Our experiment aims to analyze how gene expression patterns in cells are affected by different oxygen environments, specifically normoxia (normal oxygen levels) and hypoxia (reduced oxygen levels). Understanding the impact of oxygen availability on gene expression is crucial, as it plays a fundamental role in various biological processes, including cellular metabolism, development, and disease progression. By investigating the changes in gene expression under normoxic and hypoxic conditions, we can gain insights into the molecular mechanisms that cells employ to adapt and survive in low oxygen environments.
To achieve this, we will utilize two advanced sequencing methods: Smart-Seq and Drop-Seq. These methods enable us to capture the gene expression profiles of individual cells with high resolution, allowing us to examine the heterogeneity within cell populations and identify subtle transcriptional changes induced by oxygen levels. By applying these techniques to our two selected cell lines, HCC1086 and MCF7, we aim to investigate the specific responses of liver cancer cells and breast cancer cells to changes in oxygen availability.
HCC1086 is derived from hepatocellular carcinoma, the most prevalent form of liver cancer. This aggressive malignancy is characterized by uncontrolled growth and the ability to invade surrounding tissues. Understanding the alterations in gene expression patterns associated with hypoxia in HCC1086 cells is of great importance, as hypoxia is a common feature of the tumor microenvironment and has been linked to tumor progression, metastasis, and resistance to therapy in liver cancer.
On the other hand, MCF7 is a widely studied cell line that originates from human breast adenocarcinoma. Breast cancer is a complex disease with diverse subtypes and variable responses to treatment. Investigating the influence of oxygen levels on the gene expression profiles of MCF7 cells can provide valuable insights into the adaptive mechanisms of breast cancer cells under hypoxic conditions. This knowledge may contribute to the development of novel therapeutic strategies targeting hypoxia-related pathways in breast cancer.
The data provided for our analysis is structured as .csv tables, with each column representing a single sequenced cell. The cell is identified by a specific name that includes information about its growth condition (normoxia or hypoxia). Each row in the table corresponds to a gene, identified by its unique gene symbol. This structured data format allows us to efficiently analyze and compare the gene expression levels across different cells and conditions.
By following an experimental approach, we will perform EDA, unsupervised and supervised learning. Our project aims to unravel the transcriptional changes associated with normoxia and hypoxia in HCC1086 and MCF7 cell lines.
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import random
import sys
import sklearn
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_predict
from sklearn.decomposition import PCA
We read the file with the meta data:
data_meta = pd.read_csv("/Users/ela/Documents/AI_LAB/SmartSeq/MCF7_SmartS_MetaData.tsv",delimiter="\t",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(data_meta))
print("First column: ", data_meta.iloc[ : , 0])
Dataframe dimensions: (383, 8)
First column: Filename
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam MCF7
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam MCF7
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam MCF7
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam MCF7
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam MCF7
...
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam MCF7
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam MCF7
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam MCF7
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam MCF7
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam MCF7
Name: Cell Line, Length: 383, dtype: object
Let's verify that there are no duplicate cell names in the data_meta dataset:
names = [i for i in data_meta["Cell name"]]
assert len(names) == len(set(names))
data_meta.head()
| Cell Line | Lane | Pos | Condition | Hours | Cell name | PreprocessingTag | ProcessingComments | |
|---|---|---|---|---|---|---|---|---|
| Filename | ||||||||
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A10 | Hypo | 72 | S28 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A11 | Hypo | 72 | S29 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A12 | Hypo | 72 | S30 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A1 | Norm | 72 | S1 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A2 | Norm | 72 | S2 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
Each row represents a cell and the columns are:
data_meta.describe(include='all')
| Cell Line | Lane | Pos | Condition | Hours | Cell name | PreprocessingTag | ProcessingComments | |
|---|---|---|---|---|---|---|---|---|
| count | 383 | 383 | 383 | 383 | 383.0 | 383 | 383 | 383 |
| unique | 1 | 4 | 98 | 2 | NaN | 383 | 1 | 1 |
| top | MCF7 | output.STAR.1 | A10 | Norm | NaN | S28 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| freq | 383 | 96 | 4 | 192 | NaN | 1 | 383 | 383 |
| mean | NaN | NaN | NaN | NaN | 72.0 | NaN | NaN | NaN |
| std | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN |
| min | NaN | NaN | NaN | NaN | 72.0 | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | 72.0 | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | 72.0 | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | 72.0 | NaN | NaN | NaN |
| max | NaN | NaN | NaN | NaN | 72.0 | NaN | NaN | NaN |
print(data_meta.isnull().sum())
for i in data_meta.isnull().sum():
assert i == 0
Cell Line 0 Lane 0 Pos 0 Condition 0 Hours 0 Cell name 0 PreprocessingTag 0 ProcessingComments 0 dtype: int64
There are no missing values.
Repeating the same steps for HCC1806 SmartSeq experiment (so the same experiment on another cell line), we obtain a similar result but with dataframe dimensions = (243, 8): we have 243 cells with no duplicates and no missing values in the table.
We read the file with the MCF7 SmartSeq experiment dataset:
data = pd.read_csv("/Users/ela/Documents/AI_LAB/SmartSeq/MCF7_SmartS_Unfiltered_Data.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(data))
print("First column: ", data.iloc[ : , 0])
Dataframe dimensions: (22934, 383)
First column: "WASH7P" 0
"MIR6859-1" 0
"WASH9P" 1
"OR4F29" 0
"MTND1P23" 0
...
"MT-TE" 4
"MT-CYB" 270
"MT-TT" 0
"MT-TP" 5
"MAFIP" 8
Name: "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam", Length: 22934, dtype: int64
We transpose the original dataframe to have the cells in the rows and the genes as features in the columns. We also remove the double quotes in the features' names to simplify the indexing.
def remove_double_quotes(word):
return word.replace('"', '')
data = data.rename(columns={"{}".format(i):"{}".format(remove_double_quotes(i)) for i in data.columns})
data = data.T
data = data.rename(columns={"{}".format(i):"{}".format(remove_double_quotes(i)) for i in data.columns})
print("Dataframe dimesions:", np.shape(data))
Dataframe dimesions: (383, 22934)
HCC1806 SmartSeq experiment: we have a dataframe of dimesions: (243, 23396).
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 383 entries, output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam to output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam Columns: 22934 entries, WASH7P to MAFIP dtypes: int64(22934) memory usage: 67.0+ MB
data.head()
| WASH7P | MIR6859-1 | WASH9P | OR4F29 | MTND1P23 | MTND2P28 | MTCO1P12 | MTCO2P12 | MTATP8P1 | MTATP6P1 | ... | MT-TH | MT-TS2 | MT-TL2 | MT-ND5 | MT-ND6 | MT-TE | MT-CYB | MT-TT | MT-TP | MAFIP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | 0 | 0 | 1 | 0 | 0 | 2 | 2 | 0 | 0 | 29 | ... | 0 | 0 | 0 | 505 | 147 | 4 | 270 | 0 | 5 | 8 |
| output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 12 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 76 | 0 | 0 | 0 |
| output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | ... | 1 | 0 | 0 | 44 | 8 | 0 | 66 | 0 | 1 | 0 |
| output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 68 | ... | 0 | 0 | 0 | 237 | 31 | 3 | 727 | 0 | 0 | 0 |
5 rows × 22934 columns
print(data.dtypes)
for i in data.dtypes:
assert i == "int64"
WASH7P int64
MIR6859-1 int64
WASH9P int64
OR4F29 int64
MTND1P23 int64
...
MT-TE int64
MT-CYB int64
MT-TT int64
MT-TP int64
MAFIP int64
Length: 22934, dtype: object
numeric_columns = data.select_dtypes(include=[np.number]).columns
all_numeric = len(numeric_columns) == len(data.columns)
print(all_numeric)
True
All data are numerical, there are NO categorical data.
desc_table = data.describe()
desc_table
| WASH7P | MIR6859-1 | WASH9P | OR4F29 | MTND1P23 | MTND2P28 | MTCO1P12 | MTCO2P12 | MTATP8P1 | MTATP6P1 | ... | MT-TH | MT-TS2 | MT-TL2 | MT-ND5 | MT-ND6 | MT-TE | MT-CYB | MT-TT | MT-TP | MAFIP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 383.000000 | 383.000000 | 383.000000 | 383.00000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | ... | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.000000 | 383.00000 | 383.000000 | 383.000000 | 383.000000 |
| mean | 0.133159 | 0.026110 | 1.344648 | 0.05483 | 0.049608 | 6.261097 | 4.681462 | 0.524804 | 0.073107 | 222.054830 | ... | 1.060052 | 0.443864 | 3.146214 | 1016.477807 | 204.600522 | 5.049608 | 2374.97389 | 2.083551 | 5.626632 | 1.749347 |
| std | 0.618664 | 0.249286 | 2.244543 | 0.31477 | 0.229143 | 7.565749 | 6.232649 | 0.980857 | 0.298131 | 262.616874 | ... | 1.990566 | 1.090827 | 4.265352 | 1009.444811 | 220.781927 | 6.644302 | 2920.39000 | 3.372714 | 7.511180 | 3.895204 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 23.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 172.000000 | 30.500000 | 0.000000 | 216.50000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 98.000000 | ... | 0.000000 | 0.000000 | 1.000000 | 837.000000 | 152.000000 | 3.000000 | 785.00000 | 0.000000 | 3.000000 | 0.000000 |
| 75% | 0.000000 | 0.000000 | 2.000000 | 0.00000 | 0.000000 | 10.000000 | 7.000000 | 1.000000 | 0.000000 | 370.500000 | ... | 1.000000 | 0.000000 | 5.000000 | 1549.000000 | 294.000000 | 7.000000 | 4059.00000 | 3.000000 | 8.000000 | 2.000000 |
| max | 9.000000 | 4.000000 | 20.000000 | 3.00000 | 2.000000 | 45.000000 | 36.000000 | 6.000000 | 2.000000 | 1662.000000 | ... | 15.000000 | 8.000000 | 22.000000 | 8115.000000 | 2002.000000 | 46.000000 | 16026.00000 | 22.000000 | 56.000000 | 32.000000 |
8 rows × 22934 columns
print("Global max is:", desc_table.loc["max"].max())
print("Global min is:", desc_table.loc["min"].min())
Global max is: 190556.0 Global min is: 0.0
At first glance, the dataset seems to have a lot of 0 entries and some big numbers (outliers). We will deal with sparsity and outliers in the next sections.
print(data.isnull().sum())
for i in data.isnull().sum():
assert i == 0
WASH7P 0
MIR6859-1 0
WASH9P 0
OR4F29 0
MTND1P23 0
..
MT-TE 0
MT-CYB 0
MT-TT 0
MT-TP 0
MAFIP 0
Length: 22934, dtype: int64
There are no missing values.
HCC1806 SmartSeq experiment: same results but with 210944.0 as global max.
The duplicated() function in pandas is used to identify duplicate rows in a DataFrame or Series. To see which genes are redundant, we take data.T.duplicated() that returns a boolean Series indicating which rows of data.T (columns of data) are duplicated.
duplicate_data = data.T[data.T.duplicated()]
print("Number of duplicate genes:", duplicate_data.shape[0], "over", data.T.shape[0])
print("Percentage of duplicate genes:", (duplicate_data.shape[0] * 100) / (data.T.shape[0]), "%")
Number of duplicate genes: 29 over 22934 Percentage of duplicate genes: 0.12644981250545043 %
Since we have duplicate genes, we need to understand which ones are equal to each other. To do so, we use a correlation matrix of duplicate genes.
duplicate_rows_df_t = duplicate_data.T
duplicate_rows_df_t
c_dupl = duplicate_rows_df_t.corr()
c_dupl
| KLF2P3 | UGT1A9 | SLC22A14 | COQ10BP2 | LAP3P2 | GALNT17 | PON1 | MIR664B | KCNS2 | MIR548D1 | ... | RBFOX1 | ASPA | BCL6B | CCL3L1 | OTOP3 | RNA5SP450 | PSG1 | MIR3191 | SEZ6L | ADAMTS5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KLF2P3 | 1.000000 | -0.014798 | -0.008333 | -0.008333 | -0.032300 | -0.007903 | -0.007903 | -0.007142 | -0.008333 | -0.007903 | ... | -0.008333 | -0.007903 | -0.008333 | -0.012088 | -0.006928 | -0.008333 | -0.008333 | -0.008333 | -0.006775 | -0.007903 |
| UGT1A9 | -0.014798 | 1.000000 | -0.009322 | -0.009322 | -0.008675 | -0.008841 | -0.008841 | -0.007990 | -0.009322 | -0.008841 | ... | -0.009322 | -0.008841 | -0.009322 | -0.013523 | -0.007750 | -0.009322 | -0.009322 | -0.009322 | -0.007579 | -0.008841 |
| SLC22A14 | -0.008333 | -0.009322 | 1.000000 | 0.497375 | -0.020348 | 0.948434 | 0.948434 | -0.004499 | 0.497375 | -0.004979 | ... | 0.497375 | 0.630630 | 0.497375 | -0.007615 | 0.831379 | -0.005249 | 0.497375 | -0.005249 | 0.813013 | 0.630630 |
| COQ10BP2 | -0.008333 | -0.009322 | 0.497375 | 1.000000 | -0.020348 | 0.630630 | 0.630630 | -0.004499 | 0.497375 | -0.004979 | ... | 0.497375 | 0.630630 | 0.497375 | -0.007615 | 0.134926 | -0.005249 | 0.497375 | -0.005249 | 0.112487 | 0.630630 |
| LAP3P2 | -0.032300 | -0.008675 | -0.020348 | -0.020348 | 1.000000 | -0.019299 | -0.019299 | -0.017440 | -0.020348 | -0.019299 | ... | -0.020348 | -0.019299 | -0.020348 | -0.013474 | -0.016917 | -0.020348 | -0.020348 | 0.118817 | -0.016543 | -0.019299 |
| GALNT17 | -0.007903 | -0.008841 | 0.948434 | 0.630630 | -0.019299 | 1.000000 | 1.000000 | -0.004267 | 0.630630 | -0.004722 | ... | 0.630630 | 0.799056 | 0.630630 | -0.007222 | 0.612365 | -0.004979 | 0.630630 | -0.004979 | 0.586533 | 0.799056 |
| PON1 | -0.007903 | -0.008841 | 0.948434 | 0.630630 | -0.019299 | 1.000000 | 1.000000 | -0.004267 | 0.630630 | -0.004722 | ... | 0.630630 | 0.799056 | 0.630630 | -0.007222 | 0.612365 | -0.004979 | 0.630630 | -0.004979 | 0.586533 | 0.799056 |
| MIR664B | -0.007142 | -0.007990 | -0.004499 | -0.004499 | -0.017440 | -0.004267 | -0.004267 | 1.000000 | -0.004499 | -0.004267 | ... | -0.004499 | -0.004267 | -0.004499 | 0.007958 | -0.003741 | -0.004499 | -0.004499 | -0.004499 | -0.003658 | -0.004267 |
| KCNS2 | -0.008333 | -0.009322 | 0.497375 | 0.497375 | -0.020348 | 0.630630 | 0.630630 | -0.004499 | 1.000000 | -0.004979 | ... | 0.497375 | 0.630630 | 1.000000 | 0.021357 | 0.134926 | -0.005249 | 0.497375 | -0.005249 | 0.112487 | 0.948434 |
| MIR548D1 | -0.007903 | -0.008841 | -0.004979 | -0.004979 | -0.019299 | -0.004722 | -0.004722 | -0.004267 | -0.004979 | 1.000000 | ... | -0.004979 | -0.004722 | -0.004979 | -0.007222 | -0.004139 | -0.004979 | -0.004979 | -0.004979 | -0.004048 | -0.004722 |
| STRA6LP | 0.173251 | 0.031704 | -0.022850 | 0.001061 | 0.050458 | -0.021672 | -0.021672 | -0.015486 | -0.022850 | 0.069042 | ... | -0.022850 | -0.021672 | -0.022850 | 0.034205 | -0.018997 | -0.022850 | -0.022850 | 0.096707 | -0.018577 | -0.021672 |
| MUC6 | -0.007656 | -0.008565 | 0.654887 | 0.654887 | -0.018695 | 0.829681 | 0.829681 | -0.004134 | 0.654887 | -0.004574 | ... | 0.654887 | 0.829681 | 0.654887 | -0.006996 | 0.178813 | -0.004823 | 0.654887 | -0.004823 | 0.149322 | 0.829681 |
| LINC00595 | -0.012177 | -0.013622 | -0.007671 | -0.007671 | -0.029734 | -0.007275 | -0.007275 | -0.006575 | -0.007671 | -0.007275 | ... | -0.007671 | -0.007275 | -0.007671 | -0.011127 | -0.006377 | -0.007671 | -0.007671 | -0.007671 | -0.006237 | -0.007275 |
| CACYBPP1 | -0.008333 | -0.009322 | -0.005249 | -0.005249 | -0.020348 | -0.004979 | -0.004979 | -0.004499 | -0.005249 | -0.004979 | ... | -0.005249 | -0.004979 | -0.005249 | -0.007615 | -0.004364 | -0.005249 | -0.005249 | -0.005249 | -0.004268 | -0.004979 |
| KNOP1P1 | -0.011158 | -0.012483 | -0.007029 | -0.007029 | -0.027247 | -0.006667 | -0.006667 | -0.006025 | -0.007029 | -0.006667 | ... | -0.007029 | -0.006667 | -0.007029 | 0.031184 | -0.005844 | -0.007029 | -0.007029 | -0.007029 | -0.005715 | -0.006667 |
| WDR95P | -0.007903 | -0.008841 | 0.948434 | 0.312826 | -0.019299 | 0.799056 | 0.799056 | -0.004267 | 0.312826 | -0.004722 | ... | 0.312826 | 0.397167 | 0.312826 | -0.007222 | 0.964653 | -0.004979 | 0.312826 | -0.004979 | 0.955646 | 0.397167 |
| MIR19B1 | -0.007903 | -0.008841 | -0.004979 | -0.004979 | 0.156686 | -0.004722 | -0.004722 | -0.004267 | -0.004979 | -0.004722 | ... | -0.004979 | -0.004722 | -0.004979 | -0.007222 | -0.004139 | -0.004979 | -0.004979 | -0.004979 | -0.004048 | -0.004722 |
| RNU6-539P | -0.008333 | -0.009322 | -0.005249 | -0.005249 | -0.020348 | -0.004979 | -0.004979 | -0.004499 | -0.005249 | -0.004979 | ... | -0.005249 | -0.004979 | -0.005249 | -0.007615 | -0.004364 | -0.005249 | -0.005249 | -0.005249 | -0.004268 | -0.004979 |
| SNURF | -0.008333 | -0.009322 | -0.005249 | -0.005249 | -0.020348 | -0.004979 | -0.004979 | -0.004499 | -0.005249 | -0.004979 | ... | -0.005249 | -0.004979 | -0.005249 | -0.007615 | -0.004364 | -0.005249 | -0.005249 | -0.005249 | -0.004268 | -0.004979 |
| RBFOX1 | -0.008333 | -0.009322 | 0.497375 | 0.497375 | -0.020348 | 0.630630 | 0.630630 | -0.004499 | 0.497375 | -0.004979 | ... | 1.000000 | 0.630630 | 0.497375 | -0.007615 | 0.134926 | -0.005249 | 0.497375 | -0.005249 | 0.112487 | 0.630630 |
| ASPA | -0.007903 | -0.008841 | 0.630630 | 0.630630 | -0.019299 | 0.799056 | 0.799056 | -0.004267 | 0.630630 | -0.004722 | ... | 0.630630 | 1.000000 | 0.630630 | -0.007222 | 0.172005 | -0.004979 | 0.630630 | -0.004979 | 0.143597 | 0.799056 |
| BCL6B | -0.008333 | -0.009322 | 0.497375 | 0.497375 | -0.020348 | 0.630630 | 0.630630 | -0.004499 | 1.000000 | -0.004979 | ... | 0.497375 | 0.630630 | 1.000000 | 0.021357 | 0.134926 | -0.005249 | 0.497375 | -0.005249 | 0.112487 | 0.948434 |
| CCL3L1 | -0.012088 | -0.013523 | -0.007615 | -0.007615 | -0.013474 | -0.007222 | -0.007222 | 0.007958 | 0.021357 | -0.007222 | ... | -0.007615 | -0.007222 | 0.021357 | 1.000000 | -0.006331 | -0.007615 | -0.007615 | -0.007615 | -0.006191 | 0.011096 |
| OTOP3 | -0.006928 | -0.007750 | 0.831379 | 0.134926 | -0.016917 | 0.612365 | 0.612365 | -0.003741 | 0.134926 | -0.004139 | ... | 0.134926 | 0.172005 | 0.134926 | -0.006331 | 1.000000 | -0.004364 | 0.134926 | -0.004364 | 0.999479 | 0.172005 |
| RNA5SP450 | -0.008333 | -0.009322 | -0.005249 | -0.005249 | -0.020348 | -0.004979 | -0.004979 | -0.004499 | -0.005249 | -0.004979 | ... | -0.005249 | -0.004979 | -0.005249 | -0.007615 | -0.004364 | 1.000000 | -0.005249 | -0.005249 | -0.004268 | -0.004979 |
| PSG1 | -0.008333 | -0.009322 | 0.497375 | 0.497375 | -0.020348 | 0.630630 | 0.630630 | -0.004499 | 0.497375 | -0.004979 | ... | 0.497375 | 0.630630 | 0.497375 | -0.007615 | 0.134926 | -0.005249 | 1.000000 | -0.005249 | 0.112487 | 0.630630 |
| MIR3191 | -0.008333 | -0.009322 | -0.005249 | -0.005249 | 0.118817 | -0.004979 | -0.004979 | -0.004499 | -0.005249 | -0.004979 | ... | -0.005249 | -0.004979 | -0.005249 | -0.007615 | -0.004364 | -0.005249 | -0.005249 | 1.000000 | -0.004268 | -0.004979 |
| SEZ6L | -0.006775 | -0.007579 | 0.813013 | 0.112487 | -0.016543 | 0.586533 | 0.586533 | -0.003658 | 0.112487 | -0.004048 | ... | 0.112487 | 0.143597 | 0.112487 | -0.006191 | 0.999479 | -0.004268 | 0.112487 | -0.004268 | 1.000000 | 0.143597 |
| ADAMTS5 | -0.007903 | -0.008841 | 0.630630 | 0.630630 | -0.019299 | 0.799056 | 0.799056 | -0.004267 | 0.948434 | -0.004722 | ... | 0.630630 | 0.799056 | 0.948434 | 0.011096 | 0.172005 | -0.004979 | 0.630630 | -0.004979 | 0.143597 | 1.000000 |
29 rows × 29 columns
data_noDup = data.T.drop_duplicates(inplace=False)
data_noDup.T
| WASH7P | MIR6859-1 | WASH9P | OR4F29 | MTND1P23 | MTND2P28 | MTCO1P12 | MTCO2P12 | MTATP8P1 | MTATP6P1 | ... | MT-TH | MT-TS2 | MT-TL2 | MT-ND5 | MT-ND6 | MT-TE | MT-CYB | MT-TT | MT-TP | MAFIP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | 0 | 0 | 1 | 0 | 0 | 2 | 2 | 0 | 0 | 29 | ... | 0 | 0 | 0 | 505 | 147 | 4 | 270 | 0 | 5 | 8 |
| output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 12 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 76 | 0 | 0 | 0 |
| output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | ... | 1 | 0 | 0 | 44 | 8 | 0 | 66 | 0 | 1 | 0 |
| output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 68 | ... | 0 | 0 | 0 | 237 | 31 | 3 | 727 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 49 | ... | 0 | 0 | 1 | 341 | 46 | 1 | 570 | 0 | 0 | 0 |
| output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | 0 | 0 | 1 | 0 | 0 | 2 | 5 | 5 | 0 | 370 | ... | 0 | 0 | 2 | 1612 | 215 | 6 | 3477 | 3 | 7 | 6 |
| output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | 1 | 0 | 1 | 0 | 0 | 7 | 0 | 0 | 0 | 33 | ... | 0 | 0 | 0 | 62 | 20 | 0 | 349 | 0 | 2 | 0 |
| output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | 0 | 0 | 4 | 1 | 0 | 29 | 4 | 0 | 0 | 228 | ... | 3 | 0 | 2 | 1934 | 575 | 7 | 2184 | 2 | 28 | 1 |
| output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | 1 | 0 | 5 | 0 | 0 | 5 | 3 | 0 | 0 | 71 | ... | 5 | 2 | 3 | 1328 | 490 | 4 | 1149 | 2 | 11 | 4 |
383 rows × 22905 columns
data_noDup.T.shape
(383, 22905)
assert ((data.shape[1] - data_noDup.T.shape[1]) == duplicate_data.shape[0])
data = data_noDup.T
HCC1806 SmartSeq experiment: the umber of duplicate genes is 54 over 23396. Therefore, after we remove the duplicates, the shape of tha dataset will be (243, 23342).
To study the correlation between different cells, we compute the correlation matrix and plot a heatmap in order to visualize it.
plt.figure(figsize=(10,8))
c= data.T.corr() # it computes the correlation between the columns of data.T (the cells)
midpoint = (c.values.max() - c.values.min()) /2 + c.values.min() # calculates the average correlation value between the expression profiles of cells (find the maximum and minimum correlation values in the c matrix and computes the average of these two values)
sns.heatmap(c,cmap='coolwarm', center=0) # correlation matrix c as input and applies the colormap 'coolwarm'. The center=0 argument sets the midpoint of the colormap at zero, so positive and negative correlations are shown with different colors
print("Number of cells included: ", np.shape(c))
print("Average correlation of expression profiles between cells: ", midpoint)
print("Min. correlation of expression profiles between cells: ", c.values.min())
Number of cells included: (383, 383) Average correlation of expression profiles between cells: 0.49898217617448165 Min. correlation of expression profiles between cells: -0.002035647651036618
Looking at the previous map, we can notice that some cells have very low correlation values. We now try to further investigate why this happens.
We first visualize the previous plot only for some of the cells, in order to select two cells that have low correlation values with all others and two of them that show high correlation values.
data_subset = data.iloc[:20, :]
plt.figure(figsize=(10,8))
c= data_subset.T.corr() # it computes the correlation between the columns of data.T (the cells)
midpoint = (c.values.max() - c.values.min()) /2 + c.values.min() # calculates the average correlation value between the expression profiles of cells (find the maximum and minimum correlation values in the c matrix and computes the average of these two values)
sns.heatmap(c,cmap='coolwarm', center=0) # correlation matrix c as input and applies the colormap 'coolwarm'. The center=0 argument sets the midpoint of the colormap at zero, so positive and negative correlations are shown with different colors
print("Number of cells included: ", np.shape(c))
Number of cells included: (20, 20)
data_subset_1 = data.iloc[30:60, :]
plt.figure(figsize=(10,8))
c= data_subset_1.T.corr() # it computes the correlation between the columns of data.T (the cells)
midpoint = (c.values.max() - c.values.min()) /2 + c.values.min() # calculates the average correlation value between the expression profiles of cells (find the maximum and minimum correlation values in the c matrix and computes the average of these two values)
sns.heatmap(c,cmap='coolwarm', center=0) # correlation matrix c as input and applies the colormap 'coolwarm'. The center=0 argument sets the midpoint of the colormap at zero, so positive and negative correlations are shown with different colors
print("Number of cells included: ", np.shape(c))
Number of cells included: (30, 30)
# Cells WITHOUT correlation
cell_1_nocorr = 'output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam'
cell_2_nocorr = 'output.STAR.1_D8_Hypo_S170_Aligned.sortedByCoord.out.bam'
# Cells WITH correlation
cell_3_corr = 'output.STAR.1_C4_Norm_S100_Aligned.sortedByCoord.out.bam'
cell_4_corr = 'output.STAR.4_B4_Norm_S70_Aligned.sortedByCoord.out.bam'
Let's try to visualize their gene expression through their violin plots:
sns.violinplot(x=data.loc[cell_1_nocorr])
plt.show()
sns.violinplot(x= data.loc[cell_2_nocorr])
plt.show()
sns.violinplot(x= data.loc[cell_3_corr])
plt.show()
sns.violinplot(x= data.loc[cell_4_corr])
plt.show()
row1_values = data.loc[cell_1_nocorr]
row2_values = data.loc[cell_2_nocorr]
row3_values = data.loc[cell_3_corr]
row4_values = data.loc[cell_4_corr]
# Step 4: Create a new DataFrame using the selected rows
elem = pd.DataFrame({ 'Cell 1 WITHOUT correlation': row1_values, 'Cell 2 WITHOUT correlation': row2_values, 'Cell 3 WITH correlation': row3_values, 'Cell 4 WITH correlation': row4_values})
# Step 5: Plot the violin plot
plt.figure(figsize=(16,4))
sns.violinplot(data=elem)
plt.show()
After the visualization of the plots, we can deduce that the cells that show no correlation at all are the ones that express few genes. We will need to remove them later.
Let's try identify the number of outliers and the percentage over the total, assuming that an outlier falls in the 20th quantile (above 95%)
To find the outliers, we compute the 25th percentile Q1 (the value below which 25% of the data falls) and the 75th percentile Q3 (the value below which 75% of the data falls) for each column and find the interquartile range, that is a measure of the spread of the middle 50% of the data used to identify outliers.
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
WASH7P 0.0
MIR6859-1 0.0
WASH9P 2.0
OR4F29 0.0
MTND1P23 0.0
...
MT-TE 7.0
MT-CYB 3842.5
MT-TT 3.0
MT-TP 8.0
MAFIP 2.0
Length: 22905, dtype: float64
IQR.value_counts()
0.0 10616
1.0 606
2.0 354
3.0 281
4.0 251
...
373.5 1
324.0 1
572.0 1
692.5 1
3842.5 1
Length: 992, dtype: int64
We can see that many genes have an interquartile range of 0.
data_noOut = data[~((data < (Q1 - 1.5 * IQR)) |(data > (Q3 + 1.5 * IQR))).any(axis=1)]
print(data_noOut.shape)
(4, 22905)
Considering as outliers the elements of value beyond 1.5 * IQR, if we remove any of these rows/cells, we will remove all values below the first quartile (Q1) or above the third quartile (Q3), so we will obtain a dataset with very few remaining datapoints. We thus should proceed in another way.
HCC1806 SmartSeq experiment: we would obtain a resulting dataframe of dimensions (0, 23342), therefore an empty one.
We could try to compute the IQR of each row of the dataset: we transpose the dataset and proceed as above.
dataT = data.T
Q1 = dataT.quantile(0.25)
Q3 = dataT.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam 17.0
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam 0.0
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam 5.0
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam 0.0
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam 7.0
...
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam 9.0
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam 27.0
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam 30.0
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam 38.0
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam 33.0
Length: 383, dtype: float64
IQR.value_counts()
0.0 38 33.0 16 35.0 15 34.0 14 28.0 12 31.0 12 2.0 12 17.0 11 32.0 11 25.0 10 18.0 10 29.0 10 42.0 9 14.0 9 27.0 9 11.0 8 39.0 8 1.0 8 9.0 8 38.0 8 30.0 7 45.0 7 21.0 7 23.0 7 37.0 7 13.0 7 19.0 7 26.0 7 3.0 6 15.0 6 8.0 6 5.0 6 22.0 5 36.0 5 7.0 5 20.0 5 4.0 5 6.0 5 10.0 4 40.0 4 43.0 4 41.0 4 16.0 4 44.0 4 12.0 3 24.0 3 47.0 2 46.0 1 50.0 1 49.0 1 dtype: int64
data_noOut_T = dataT[~((dataT < (Q1 - 1.5 * IQR)) |(dataT > (Q3 + 1.5 * IQR))).any(axis=1)]
data_noOut = data_noOut_T.T
print(data_noOut.shape)
(383, 6424)
print("Difference of number of columns:", data.shape[1]-data_noOut.shape[1])
Difference of number of columns: 16481
print("Percentage of removed columns:", (data.shape[1]-data_noOut.shape[1])/data.shape[1]*100, "%")
Percentage of removed columns: 71.9537218947828 %
If we remove the rows that have outlier values in any column in the transposed dataset, so any column of the original dataset (gene) that has values that are more than 1.5 times the interquartile range (IQR) below the first quartile (Q1) or above the third quartile (Q3), we obtain a final dataframe with 6424 columns. We therefore removed a total of 16481 genes, more than 70% of the genes.
HCC1806 SmartSeq experiment: percentage of removed columns of 53.77431239825208 %, which is still very high.
It is important to notice that outliers are to be treated very carefully in this case. An observation expresses the RNA sequencing counts, therefore a very high count value should not be treated as an error, but rather as an important indicator. We will investigate in one of the following sections whether removing outliers improves our results.
Let's try to gain more information about the dataset and how to treat the outliers. We do violin plots for some cell singularly (and randomly).
rows = list(data.index)
random.seed(88)
ind1 = random.randint(0,243)
print(ind1)
sns.violinplot(x= data.loc[rows[ind1]])
plt.show()
101
ind2 = random.randint(0,243)
print(ind2)
sns.violinplot(x= data.loc[rows[ind2]])
plt.show()
48
We can notice that, taking two randomly chosen cells, they have very similar plots:
However, maxima are considerably different.
Now we do a violin plot with all the cells. To better viualize it we use as xticks labels the cell name attribute of each row (saved in the meta data).
For 50 cells:
names = [i for i in data_meta["Cell name"]] # get the cell name attribute of each cell
data_small = data.T.iloc[:, :50]
names_small = names[:50] # select 50 cells
plt.figure(figsize=(16,4))
plot=sns.violinplot(data=data_small, palette="Set3", cut=0)
plot.set_xticklabels(names_small, rotation=90, fontsize=6)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()
Going back to the issue of outliers, let's plot the first 50 cells of the dataset without outliers:
data_noOut_small = data_noOut.T.iloc[:, :50]
names_small = names[:50]
plt.figure(figsize=(16,4))
plot=sns.violinplot(data=data_noOut_small, palette="Set3", cut=0)
plot.set_xticklabels(names_small, rotation=90, fontsize=6)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()
Let's visualize the plot of the previously randomly chosen single cells excluding outliers:
print(ind1)
sns.violinplot(x= data_noOut.loc[rows[ind1]])
plt.show()
101
print(ind2)
sns.violinplot(x= data_noOut.loc[rows[ind2]])
plt.show()
48
From the previous plots we can deduce that, if we remove the outliers, the maxima take lower values but still we have a big amount of zeroes.
We can deduce that the dataset is sparse. Let's analyze this concept more in detail.
HCC1806 SmartSeq experiment: similar results and same conclusion.
Sparsity means that the matrix contains many zero values.
We can try to quantify sparsity of the dataset, calculating the proportion of zero values in the gene expression matrix as:
sparsity = (number of zeros) / (total number of elements in the matrix)
Let's compute this sparsity index for the original dataset:
n_zeros = np.count_nonzero(data==0) # count the number of elements in the boolean mask (data == 0) that are true, so the number of 0 elements
print('Number of 0 values in the matrix:', n_zeros)
sp = n_zeros / data.size
print('Sparsity index:', sp*100, '%')
Number of 0 values in the matrix: 5278229 Sparsity index: 60.16711094696393 %
In the dataset without outliers, we obtain:
n_zeros_noout = np.count_nonzero(data_noOut==0) # count the number of elements in the boolean mask (data == 0) that are true, so the number of 0 elements
print('Number of 0 values in the matrix:', n_zeros_noout)
sp_noout = n_zeros_noout / data_noOut.size
print('Sparsity index:', sp_noout*100, '%')
Number of 0 values in the matrix: 2349283 Sparsity index: 95.48409359159028 %
We can see that removing outliers is not a good idea, since sparsity is even higher than before.
HCC1806 SmartSeq experiment: the sparsity index of the original dataset is about 55.8 % and the one of the dataset without outliers is about 86.6 %, so the same conclusion holds.
Sparse data may lead to several problems for training a machine learning model (like over-fitting, lower performance of the models, etc.), so they should be handled properly.
Even considering the original dataset, the sparsity index underlines that more than half of the elements in the matrix is equal to 0, so the dataset is sparse.
Using sparse matrix representation can be advantageous in cases where the data is sparse. Indeed, sparse matrices only store the non-zero values in the matrix, which can lead to significant memory savings. But here our main problem is not memory saving, so we can try to work with the dense representation.
There are several ways to adress the sparsity problem when training a machine learning model on gene expression data.
For instance, we could employ dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce the dimensionality of the data while preserving the most important features.
We employ PCA to adress the sparsity problem in the next sections.
To examine the distribution of the dataset, we look at Skewness and Kurtosis of the gene expression profiles.
Skewness measures the degree of asymmetry in the distribution. A distribution is said to be skewed if it is not symmetric around its mean.
Kurtosis measures the degree of peakedness or flatness in the distribution, so of how heavy the tails are.
We will use the scipy.stats module to calculate the skewness and kurtosis of each column in a data.T, so each row (cell) of hcc:
from scipy.stats import kurtosis, skew
cnames = list(data.T.columns)
colN = np.shape(data.T)[1]
data_skew_cells = []
for i in range(colN) :
name = data.T[cnames[i]]
data_skew_cells += [skew(name)]
sns.histplot(data_skew_cells,bins=100)
plt.xlabel('Skewness of single cells expression profiles - original df')
plt.show()
print( "Skewness of data: ", data_skew_cells)
print("Mean of skewness values:", np.mean(data_skew_cells))
Skewness of data: [65.3293476728411, 38.73257818368301, 48.14055338427522, 25.51111003985754, 61.807162316617756, 67.04123335084539, 36.590323800613746, 71.0267787850064, 46.968455285544245, 50.700986376200646, 62.01708428399256, 48.47588656194911, 43.95274295898324, 44.84104723244163, 45.212925257275465, 78.61404436439186, 58.977735156829084, 21.916000765107164, 73.77516658973012, 59.01333707783532, 66.18273458292978, 58.92877893459357, 44.94185124605908, 103.80162853428926, 57.85913666503987, 52.61678597578867, 79.95809503472208, 29.797032951121288, 75.83878274045708, 62.732518968901765, 58.931309180084575, 56.32305724868327, 57.02376162042931, 57.37104819661485, 35.84361391791805, 69.65974750539775, 51.515414034789075, 50.39866420239922, 38.03240678913514, 63.80300193632892, 55.837467193115366, 50.10913881415591, 65.46704067028153, 35.08184883135755, 39.332180695995035, 57.596025526774575, 82.7008853973779, 61.19103247245235, 52.774568256734895, 68.02944491194704, 38.53118067620604, 28.588176975726686, 57.00620049315085, 63.64230198109848, 79.5814433438006, 69.39567916304566, 53.10027574801472, 49.16127203646519, 47.0829797328888, 62.772508037507386, 19.35792192037944, 62.748039864834105, 52.968565526844934, 68.34729685346215, 58.81258441759759, 49.82857658687345, 59.87512984436019, 58.25736378796296, 28.484880862403536, 56.254329732040226, 56.07399539571472, 80.76084961070782, 52.46314466682449, 47.56013553643926, 41.93952365975386, 48.41525226850443, 66.45470740184908, 23.430231658046488, 52.59660511284909, 26.996445100615517, 88.72384645247065, 57.73638014488098, 51.67815606794171, 49.99290619058036, 58.91165399779514, 57.62940065991728, 86.87385559376973, 151.33406769019496, 63.43405396480639, 67.84355562871343, 51.23046612192652, 32.22013850139017, 69.35951824561127, 46.783572043105494, 49.62970467475006, 61.30010672793698, 56.55120164973365, 47.72257754502327, 41.13858434684936, 70.61539961968228, 48.457492919999616, 64.81169245584256, 64.18147361625405, 61.18360142189198, 71.55326665580401, 40.97540051997736, 37.090630098898394, 65.33448071876276, 39.503515067138935, 48.45693266453451, 44.87292710996915, 57.05418158044058, 45.43465346616777, 77.41202888313158, 63.76901181667419, 34.36351904649953, 55.38381935661642, 48.33926740746969, 92.23731195460935, 69.28370745364008, 57.81637890874989, 55.96700084744578, 40.92502099919537, 67.07014242268563, 58.144113642609085, 74.99811120807789, 75.51163013852279, 67.48424289731418, 62.19081936604902, 55.21105149409568, 52.08741382622933, 65.43988375973596, 63.244760618906305, 65.92890442681333, 43.070529413597185, 62.75642336224687, 49.88366068684939, 81.94118870168232, 78.77161002973799, 55.00290504417747, 73.07432615063674, 55.23996981794865, 34.18237073016225, 55.281076882684786, 48.91830707689147, 43.60985448206934, 49.74136703691775, 86.82210526917677, 50.20629767794813, 45.28184443992491, 71.12107067316758, 49.150796466940506, 33.68738010638049, 57.39347522014365, 45.52613634560229, 58.542644230781406, 54.92728433790341, 41.642345566087705, 49.89789130011741, 63.18998861435136, 52.03625521757694, 37.729818290089334, 27.53423453272267, 62.03878036762236, 55.85119214512102, 53.044524771038354, 50.42496058788768, 45.99718633306048, 45.625165598140384, 47.573358236968716, 48.76164952526791, 70.2482386647605, 65.79841483303738, 41.997775287411756, 60.68684882596332, 70.86052280096243, 67.69556128113159, 53.045714746235525, 62.544549883870815, 35.235716374917104, 49.48740432116383, 71.58145983827528, 53.52661726255076, 43.99946345530929, 52.13854786125857, 47.97016856600509, 71.37906051323417, 38.76801202337355, 52.51851262701272, 101.94156548640015, 74.62936306206454, 41.56100743024018, 41.04854658446184, 69.19704343287377, 55.482292610340934, 47.25928489104872, 50.59742956337423, 67.04990393240291, 62.98667381046708, 55.77599367565253, 45.18407378137543, 53.868376733327196, 47.98521307581243, 58.886993180664476, 44.406854835903296, 46.66735387869767, 54.421230993529896, 64.99077419616977, 50.809752914776915, 53.305455105735504, 42.44164975643002, 76.28616849400905, 30.513831906406917, 52.072350578753515, 46.57930393283949, 66.12570371672652, 90.0120279458302, 51.80056717707419, 54.02760799197032, 80.37132968273718, 56.188572523463975, 33.170347102801685, 83.02188343365239, 40.24635237895199, 49.608195703718195, 53.81382356664251, 66.95755855926298, 51.577004656549796, 36.6315107238475, 59.712343377393665, 59.81952216900437, 38.9174653994669, 37.77615469598503, 25.589032769036518, 72.3552385117538, 58.03440017648565, 63.15715812028286, 49.7693185111637, 50.03068560842177, 58.85206988408884, 77.2383483553717, 82.21980739654236, 56.51803116621448, 47.70105460048302, 69.77039875138436, 113.13264231357152, 65.14209478323767, 38.79209286159597, 55.60092901557804, 33.587930551375, 42.69652622843581, 49.14254323710053, 43.46646574962866, 39.627744544153444, 43.335485500145026, 49.45431856299877, 42.08273168383305, 51.44922158945097, 63.142832024553655, 71.91338373615913, 71.95698020141732, 47.885089295025054, 68.89190992918387, 58.19396494713139, 72.20759519014264, 51.48779179277785, 44.39415641321146, 37.31009703960305, 53.37008873321303, 128.2837228531314, 40.57255244500706, 57.309762568619874, 61.176446393595256, 50.70615786100201, 40.81259361962592, 44.90484418596253, 54.90719933156673, 50.774340907341504, 52.06327035931143, 71.81452996797563, 35.97333358798076, 44.55485680209889, 68.52589829388387, 51.430597409341594, 50.54153766138653, 42.7743100791412, 80.49210280803489, 40.03164103558266, 51.302607041675266, 66.030767672662, 30.23828821108307, 46.97286764265703, 63.27425599547708, 52.41015628352887, 45.216424084882746, 9.855202939473623, 44.76084600263007, 58.55135643548786, 47.96201262500501, 66.63060944371539, 45.21249358312084, 57.92877525822537, 52.14043112806137, 52.479675277476204, 34.85791433254715, 60.27744583614408, 50.6893633089063, 35.71475156968971, 38.816912906101436, 41.59598243398502, 65.01595386209534, 66.26284364942425, 41.42238430550905, 73.03452633982722, 51.63442688841827, 49.48355176127539, 54.25555823748036, 53.22105924461813, 61.44495881139293, 57.166805028938576, 58.984914742852354, 48.41275353475184, 40.34489211874943, 48.85963753249201, 79.44203610898319, 48.26989126635675, 47.61550962075728, 68.78482209870575, 60.01468411323561, 61.734478070399, 80.54560765683946, 67.82918177987547, 65.19070287784007, 42.650536294401874, 58.21857916068947, 43.91550533086219, 50.70822991047499, 45.8530948867382, 58.9007258602306, 42.10600918245145, 76.63725182212742, 39.8324226289307, 77.11833738811501, 49.89117093047586, 43.613747210802146, 58.482387972493, 46.27310124868976, 49.94790905529416, 60.9853413876058, 70.78177899044664, 47.00634883327633, 76.85923290456891, 53.72367495857801, 46.739669205914986, 76.34138532027994, 60.21598500266795, 64.04104764679742, 41.56208073514063, 41.5842396485268, 68.11466755330545, 68.00477209879008, 51.334030790956184, 54.091429856453026, 28.81039550302069, 48.742414132421835, 58.83617413941023, 50.77734148088826, 72.56432965207068, 53.926899678862064, 57.21605880930854, 47.70302257650459, 49.33473346404864, 54.20737574625473, 55.85075876727569, 54.45153322982945, 84.38903329203136, 73.83782191673744, 48.55240272287677, 74.35852310510634, 45.46362806672874, 42.05736916684996, 47.96387486528263, 55.998763824447785] Mean of skewness values: 55.56539160251531
data_kurt_cells = []
for i in range(colN) :
name = data.T[cnames[i]]
data_kurt_cells += [kurtosis(name)]
sns.histplot(data_kurt_cells, bins=100)
plt.xlabel('Kurtosis of single cells expression profiles - original df')
plt.show()
print( "Excess kurtosis of data distribution: ", data_kurt_cells)
print("Mean of kurtosis values:", np.mean(data_kurt_cells))
Excess kurtosis of data distribution: [5463.645282643022, 1995.8520052733418, 2901.798051720723, 917.7893553704595, 4656.550578232189, 5580.3195682340875, 2010.9651813965895, 6219.628279280727, 2990.67999138626, 2940.5931578284826, 4592.719301484182, 2627.0498838917624, 2217.448595585073, 2356.356087133049, 2473.511712027504, 7973.608919171623, 4313.571814924348, 799.5354281945483, 6299.13993323815, 4649.269380039665, 5777.840053343645, 3807.887574395473, 2455.9004110862716, 11869.602376301447, 4005.2681270595376, 3504.796075053494, 8055.60069416592, 1308.7672887695835, 6931.3347328174195, 5124.5504128334205, 4615.4316484638475, 4124.508712212333, 4177.310501731559, 3997.8462925983804, 1765.1957149921402, 6856.335494083642, 3107.9509626406248, 2864.8863034575156, 1914.8864839697094, 4998.6709103247695, 4158.328482025339, 3254.8422508530657, 5205.227522488022, 1888.1523758896278, 2312.149738454117, 4020.478021860082, 7434.874527501197, 4223.805454645823, 3561.3577646370686, 5434.304721270646, 1864.0934617869314, 1390.4780672023965, 4295.401594094751, 5224.965021123387, 7280.033240705879, 5734.1372804418, 3710.1110312866977, 2775.454269657457, 2663.638557831071, 4665.859690356973, 497.3704841811498, 5107.989280090836, 3389.8835114473163, 5504.961530308903, 5057.028578365861, 3132.5918695287983, 4940.824456901505, 4312.615332582832, 1233.9664631191279, 3619.1998750662406, 3812.8676147101846, 8515.664727014871, 3478.932404791178, 2856.325039229736, 1756.9236448070544, 2977.711481441854, 5193.040337855706, 807.8196458251136, 3669.64560322169, 1075.978660770195, 8784.423618701825, 3850.0976971364717, 3319.8071727103033, 2869.845189461686, 3984.67665982721, 3981.2591318765144, 8845.231023160475, 22900.00004366052, 4881.9806268433085, 5620.04733495222, 3219.0773902416886, 1036.1373250487657, 6019.742985602328, 2670.1115981833523, 3092.339365179215, 4119.185789468242, 3979.610995282692, 2843.5353900207965, 2170.343327619934, 6236.434101369996, 3227.57035551919, 5320.835337685486, 5275.1235674631525, 4866.735297166459, 6305.186953678495, 2212.522949399944, 1585.466191225797, 5234.486170847774, 2246.4056763200015, 3098.113817321201, 2480.162192703061, 4094.9748013092767, 3001.7239748848842, 7182.260410138484, 4852.6121214290915, 1811.3474297796356, 3939.109919649032, 2807.3080830274907, 11283.313915736582, 5510.434643871869, 4215.075016747391, 3665.15195579173, 2047.9436793516334, 5487.172076977874, 4420.16443985175, 6900.608390335716, 7043.742036376604, 5464.644610235604, 4952.972930326694, 3424.9810968987895, 3232.616273767296, 5871.633613415097, 4952.237054706167, 5398.904423336076, 2275.5894790688576, 4676.676858305407, 3381.4266934198126, 7839.865173239221, 7409.061118971837, 4082.847786735905, 6618.752084940854, 3583.7467263839626, 1510.3684965275293, 3382.113611231845, 2761.4886480486775, 2444.6872967524328, 3043.4665807225338, 8757.903968972863, 3281.526384659717, 2776.8066052474683, 6230.629943506198, 3373.0046736829477, 1524.771526262693, 3947.4485575372146, 2393.9348139393596, 4333.457032938904, 3526.990594039238, 2270.048684237479, 3377.9473640612136, 5046.44948980189, 3999.8150973761417, 2235.670451971863, 1158.0088627838036, 4959.182696132661, 4195.154566570198, 3311.102308851241, 3040.861240928802, 2493.8408983628074, 2470.8587738742335, 2936.573680682571, 2968.044917423188, 5808.789355497862, 5784.099159259742, 2252.611234807281, 4713.521174620785, 6187.608011680268, 5807.185987206108, 3202.4927997677937, 5169.264707114527, 1658.0507614192088, 3042.585784982708, 6828.8078641901875, 3503.783192788276, 2571.728916989993, 3705.616387860165, 3149.3603078388537, 6364.922732519532, 2199.1406074975916, 3733.1688525442114, 12665.602610634798, 6877.947209967625, 1969.08962536927, 2099.196486462552, 5467.745158979508, 3884.8483027660527, 3047.7267644081076, 3686.508353186244, 5636.403354497779, 5202.090630855094, 4289.575310202222, 2791.260121155426, 3522.752177209856, 2960.6496286759307, 4602.372853072517, 2510.7647908630825, 2641.336995906534, 4622.023844898401, 5544.318793682863, 3418.495740090052, 3808.211213525726, 2585.429093686284, 7235.300601151805, 1433.0507894849848, 3177.4587411738917, 2639.277519366078, 5266.742715892161, 10829.915766777687, 3111.5661584688464, 3565.381672916133, 7691.428110840698, 4301.881939298061, 1579.5749354351344, 8176.498578323929, 2204.0920803263252, 3722.00256638277, 3912.4511812299693, 5813.981681928593, 3083.859114031102, 1634.9183055023975, 4826.395805990639, 4148.19349345752, 2209.439682405443, 2057.900958062454, 1226.950231677343, 6868.6150172985535, 4437.7595206463575, 5017.62059882919, 3096.7856276012517, 3060.7965659316633, 4324.943398671239, 7624.3532778644785, 8552.570701124041, 3832.6558015026358, 3153.0582062575513, 5646.099501366646, 13276.122074577097, 5582.6352136103005, 2166.0767064776555, 4454.425635371692, 1267.8347294402763, 2567.9628129062094, 2841.4647373060716, 2555.7868484867595, 2559.702804053541, 2615.9066264192566, 3454.0070101244864, 2483.3459049894723, 3760.532015909568, 5372.9924158857375, 6434.874658532618, 6475.1623440573885, 3111.269556608694, 5249.889589964156, 3805.9986799188387, 6483.604096979191, 3524.854631464673, 2596.4323077209438, 1928.320769065881, 3888.2159095739603, 18033.18723528011, 2451.571721549558, 4453.079910528993, 5022.339388732607, 3109.4227860675746, 2268.4469130644125, 2445.497151123613, 3683.08122308099, 3281.6649032323644, 3104.1240308229526, 6460.76536639039, 1887.2374587547718, 2912.2266645565724, 5727.829322688068, 3807.438368846949, 3661.451878879706, 2382.8620687868647, 8407.205840987885, 2090.2859983839708, 3077.253704704526, 5008.726785428288, 1499.0663881288908, 3066.734606425461, 5342.365536475756, 3394.398233497225, 2983.0185761941857, 283.1550378543401, 2695.769805387075, 4301.7487955973465, 2880.8390346344113, 5687.7588845587525, 2430.7594026852876, 4041.614672345835, 3616.7742205148643, 3726.453115479002, 2000.5758048552589, 4587.663207914851, 3362.8008311265776, 1666.5338591745058, 2660.631833343237, 2140.2321148630776, 5096.357803926749, 5285.0659204306885, 2357.6836766311308, 6017.780050030185, 3215.3082042881283, 3377.452160688309, 4153.817287956358, 3811.7656345534365, 4672.145059906095, 4357.2879598896, 4506.651606422365, 2790.0937475955275, 2115.228817227222, 3072.839175454844, 8114.28833649313, 2825.456495608763, 3036.9303656126885, 5929.021134961383, 4514.838343882486, 4685.68924298893, 7744.992956829757, 5699.724607813706, 5353.3098299271605, 2109.405616056584, 4039.6681312856795, 2370.9889856645163, 3280.220042374156, 2647.3574138418294, 4323.506180015513, 2456.360483159077, 7202.080574549229, 2734.136619375073, 7489.643724779266, 3389.9974419281616, 2828.0133722315254, 4061.589786744237, 2529.667822057342, 3046.709483417442, 5096.057949924222, 6150.699147612293, 2771.816117936815, 7337.0778745596435, 3919.831279993978, 2996.945576741513, 7089.5998630679, 4747.25793577658, 5160.649549868427, 2113.561043559752, 2109.5567324398226, 5545.16698044071, 5769.91865198394, 3348.735308657892, 3315.735414729145, 1358.7387235579256, 3456.900481652312, 4262.848140385975, 3542.684219422403, 6592.636916719777, 3558.273090284296, 4356.978019160817, 2686.53262356749, 3130.94952360271, 3588.982952545438, 4014.305615747471, 4273.700456117557, 9408.40765933973, 6682.382097057808, 3294.305611586939, 6576.165025289555, 2776.1153291375444, 2397.0787690606717, 2870.3598178541934, 3791.121439579466] Mean of kurtosis values: 4159.683961614613
HCC1806 SmartSeq experiment: we obtain mean of skewness values = 36.711205637200365 and mean of kurtosis values = 2390.818798228346.
From these graphs, we can deduce that the distributions are highly non normal. Indeed, the high positive kurtosis values indicate a more peaked distribution compared to a normal one. Moreover, the high positive skewness values underline that the distribution is right-skewed.
In general, it is acceptable to deviate from a Gaussian distribution, as not all methods require a normal distribution and this can be addressed during the analysis. Nevertheless, it would be better to reduce skewness since highly skewed data can be challenging to manage.
Data transformation could be an option to deal with these problems. A common choice to transform highly skewed data to a distribution closer to a normal one is to apply a log based 2 transformation.
data_log2 = np.log2(data+1)
data_log2
| WASH7P | MIR6859-1 | WASH9P | OR4F29 | MTND1P23 | MTND2P28 | MTCO1P12 | MTCO2P12 | MTATP8P1 | MTATP6P1 | ... | MT-TH | MT-TS2 | MT-TL2 | MT-ND5 | MT-ND6 | MT-TE | MT-CYB | MT-TT | MT-TP | MAFIP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 1.584963 | 1.584963 | 0.000000 | 0.0 | 4.906891 | ... | 0.000000 | 0.000000 | 0.000000 | 8.982994 | 7.209453 | 2.321928 | 8.082149 | 0.000000 | 2.584963 | 3.169925 |
| output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
| output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 3.700440 | ... | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 6.266787 | 0.000000 | 0.000000 | 0.000000 |
| output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 3.000000 | ... | 1.000000 | 0.000000 | 0.000000 | 5.491853 | 3.169925 | 0.000000 | 6.066089 | 0.000000 | 1.000000 | 0.000000 |
| output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 6.108524 | ... | 0.000000 | 0.000000 | 0.000000 | 7.894818 | 5.000000 | 2.000000 | 9.507795 | 0.000000 | 0.000000 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 1.000000 | 0.000000 | 0.0 | 5.643856 | ... | 0.000000 | 0.000000 | 1.000000 | 8.417853 | 5.554589 | 1.000000 | 9.157347 | 0.000000 | 0.000000 | 0.000000 |
| output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 1.584963 | 2.584963 | 2.584963 | 0.0 | 8.535275 | ... | 0.000000 | 0.000000 | 1.584963 | 10.655531 | 7.754888 | 2.807355 | 11.764042 | 2.000000 | 3.000000 | 2.807355 |
| output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | 1.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 3.000000 | 0.000000 | 0.000000 | 0.0 | 5.087463 | ... | 0.000000 | 0.000000 | 0.000000 | 5.977280 | 4.392317 | 0.000000 | 8.451211 | 0.000000 | 1.584963 | 0.000000 |
| output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 2.321928 | 1.0 | 0.0 | 4.906891 | 2.321928 | 0.000000 | 0.0 | 7.839204 | ... | 2.000000 | 0.000000 | 1.584963 | 10.918118 | 9.169925 | 3.000000 | 11.093418 | 1.584963 | 4.857981 | 1.000000 |
| output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | 1.0 | 0.0 | 2.584963 | 0.0 | 0.0 | 2.584963 | 2.000000 | 0.000000 | 0.0 | 6.169925 | ... | 2.584963 | 1.584963 | 2.000000 | 10.376125 | 8.939579 | 2.321928 | 10.167418 | 1.584963 | 3.584963 | 2.321928 |
383 rows × 22905 columns
We visualize violin plots using the same indeces previously randomly selected.
print(ind1)
sns.violinplot(x=data_log2.loc[rows[ind1]])
101
<Axes: xlabel='output.STAR.2_A3_Norm_S9_Aligned.sortedByCoord.out.bam'>
print(ind2)
sns.violinplot(x=data_log2.loc[rows[ind2]])
48
<Axes: xlabel='output.STAR.1_E10_Hypo_S220_Aligned.sortedByCoord.out.bam'>
data_log2.T.describe()
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam | output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam | output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam | output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam | output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam | ... | output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam | output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam | output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam | output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam | output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam | output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | ... | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 | 22905.000000 |
| mean | 1.892372 | 0.009677 | 1.734012 | 0.409288 | 1.565756 | 2.177625 | 2.542539 | 2.603964 | 2.505422 | 0.628043 | ... | 1.661318 | 2.374226 | 0.512457 | 1.974534 | 1.746693 | 1.626697 | 2.147861 | 2.223999 | 2.371146 | 2.301653 |
| std | 2.744578 | 0.115966 | 3.062152 | 0.933189 | 2.159384 | 2.937413 | 3.167468 | 3.027512 | 3.108120 | 1.184831 | ... | 2.207995 | 2.864850 | 1.881161 | 2.657156 | 2.429107 | 2.242322 | 2.940746 | 2.999271 | 3.099276 | 2.993962 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 4.169925 | 0.000000 | 2.584963 | 0.000000 | 3.000000 | 4.584963 | 5.321928 | 5.169925 | 5.285402 | 1.000000 | ... | 3.321928 | 4.954196 | 0.000000 | 4.169925 | 3.700440 | 3.321928 | 4.807355 | 4.954196 | 5.285402 | 5.087463 |
| max | 15.512524 | 3.906891 | 16.324181 | 8.179909 | 13.369461 | 15.515977 | 14.850138 | 15.637446 | 15.145176 | 10.738092 | ... | 14.119671 | 14.511506 | 16.322509 | 14.850431 | 13.568669 | 14.235266 | 14.774272 | 15.313060 | 15.497540 | 16.069932 |
8 rows × 383 columns
The plot for 50 cells is:
data_small_log2 = data_log2.T.iloc[:, :50]
names_small = names[:50]
plt.figure(figsize=(16,4))
plot=sns.violinplot(data=data_small_log2, palette="Set3", cut=0)
plot.set_xticklabels(names_small, rotation=90, fontsize=6)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()
Let's visualize skewness and kurtosis of the transformed data:
cnames = list(data_log2.T.columns)
colN = np.shape(data_log2.T)[1]
colN
data_log_skew_cells = []
for i in range(colN) :
name = data_log2.T[cnames[i]]
data_log_skew_cells += [skew(name)]
sns.histplot(data_log_skew_cells, bins=100)
plt.xlabel('Skewness of single cells expression profiles - log based 2 df')
plt.show()
print( "Skewness of log base 2 df: ", data_log_skew_cells)
print("Mean skewness:", np.mean(data_log_skew_cells))
Skewness of log base 2 df: [1.1057395234854044, 15.61679195544179, 1.4670439553197712, 2.675818577529791, 1.204185389342206, 1.0040053557524433, 0.807298608546033, 0.714275166199882, 0.786030463236182, 2.1786644629707634, 3.1037743710662844, 2.3717951551680914, 0.923493751583869, 0.9124854279824427, 0.8549620433809288, 1.5336274062343476, 1.5864532323735512, 1.4396820263053096, 1.3643529331601185, 0.8283157691472006, 0.7585850471537893, 1.346823591574563, 0.9066805806659899, 37.721060851815736, 1.3444946409550385, 0.9501331188032239, 0.8547386066753241, 1.1741243574925364, 1.144944773164568, 1.4860446273370016, 1.4277677684867498, 1.0239458910695918, 0.7979132358687113, 1.7393061803458028, 1.7387396016793497, 3.80251229801015, 1.4080655380951959, 1.0452382744722124, 3.0581623164518823, 1.5362964142750475, 1.4111801073556067, 1.8621435520669851, 1.0455839714775215, 0.9060168412656754, 3.513705417834959, 1.0701002144231377, 68.91277437158584, 1.6248646572728893, 2.281439487283935, 1.048036275816149, 0.8813310635303343, 12.794692170149093, 1.304832175649116, 0.9514583811959377, 2.1788438343743475, 1.741965359578185, 1.6858426876513084, 2.199286012187878, 1.8177910167228806, 1.748642346827045, 14.71277955984679, 1.377640987911948, 1.0502564097404699, 1.6662254938773282, 1.7501823694453975, 1.3359965026274587, 0.942241500832004, 0.7764304838641324, 0.7047173051821963, 2.763775001598615, 2.921790487552734, 1.0382967835341974, 1.4123871797392535, 2.176331050495534, 41.93952365975386, 4.435332036921889, 1.6133161399630747, 1.8521520584202043, 1.6586212482815847, 8.432123213047994, 1.2422153486167304, 2.970133791969762, 1.3547059765155136, 1.8285765248615153, 0.886371325490822, 1.0925921973617385, 1.180015884037801, 151.33406769019496, 3.1091570011468512, 1.7755710608955702, 1.1381694638199766, 32.22013850139017, 0.8035422465709819, 6.984868862901932, 4.928913533358402, 1.2971490261569176, 0.9674021136425309, 0.9052073774913982, 1.2456581519173138, 0.826115978533454, 0.7858582509583311, 0.7017391899857451, 0.8758064955297499, 0.7520449871182536, 0.7801151997369425, 0.9053314100351768, 1.08009974849131, 1.2018221446612645, 1.424296591759856, 0.9228417117754738, 1.0819451586053808, 0.7651914247753657, 0.7347264288819545, 0.8615921672880887, 0.7564018729791854, 0.8239340190062358, 0.7011284779327325, 0.9694712569547393, 22.503534162129593, 0.936884867688853, 0.8738258553059898, 0.9641262872927827, 1.3468417981338792, 0.7765844675843305, 0.610801910301981, 0.7649491166252943, 0.7207559550365991, 0.7063835232435424, 0.6959973332926203, 1.4104561195108318, 0.786928046194934, 0.7655752872927486, 0.8755880726845192, 0.7831625811077118, 0.7829228478525998, 0.7716821251751296, 0.6840072169139697, 0.8184848151061577, 0.878859070590094, 3.9627924332782687, 0.6016558929022284, 1.1556093214167995, 28.260463031636327, 1.2867094456163073, 0.9248725774717333, 0.7922477934999208, 1.0346433216405906, 1.2748466986908435, 1.2292090198224905, 0.7003598611392471, 0.8924297681517445, 0.6719312803023326, 0.7398383291866648, 1.0199092551124689, 0.9945470565293134, 0.8478566434378746, 0.8058845338874457, 0.7815549546606244, 1.6757320454073001, 0.6563185966958098, 0.7932907176995568, 0.7222119922096346, 0.7184283787041242, 0.8355573270472684, 0.8229210630443338, 0.8449396334944688, 0.8053806017207498, 0.9940857312716328, 0.8226981299276362, 1.1896681736429475, 1.0413658368791465, 1.0148234896705304, 0.7823263988015297, 1.0554681243736113, 0.7342675990860774, 0.6931115316773596, 0.6571783507603457, 1.3846268815490965, 0.9482568075599738, 1.183031720838214, 0.9699722615601514, 0.8936365553701933, 2.627526924623496, 0.9148747081506595, 0.8963709355120288, 1.001168390349339, 0.8366906285041781, 0.7739919036106956, 0.7356289399242116, 1.5288881366251685, 0.7264352707907431, 1.188588005962846, 0.943303795258958, 1.1347942044856434, 1.1081506256648526, 0.7820745152080147, 0.7434773587243486, 0.808667661974666, 0.7083084680580254, 0.9154321017080279, 0.6183513448103749, 0.8199175631365759, 0.8370263727754325, 1.2447808345911824, 0.8098399595278659, 1.3708849254801265, 1.6064463690841666, 0.7221909140051747, 0.7546897876732109, 0.7789912404959812, 0.7567175371046488, 0.7232569487479341, 0.71789291359721, 0.9976697384899605, 0.7627632286630976, 1.1511724130937127, 0.9471414105252942, 0.9488823570200783, 1.2066493369134892, 0.6898164736710257, 0.679194095316987, 0.7305475665976033, 0.8309562105757424, 0.955245936363418, 0.6878023004789833, 0.9923894713536716, 0.8669662999210399, 0.9723344050693282, 1.896827846812024, 0.773635853355319, 1.2299547321763948, 0.7053952630219116, 0.7130621928886544, 5.907006058681002, 3.3979654008982996, 0.8597908821688606, 0.9064224345941718, 0.792510490154341, 1.0698707872334352, 1.2172871678741979, 22.341272488120513, 0.9674445705265393, 0.8356704263993622, 0.647868946729868, 0.8722453468005082, 37.41855741390122, 0.651413797855195, 0.7990111493433426, 0.787949762723898, 30.668626051544198, 1.0943786760460303, 0.8014895137673844, 0.9464092736867198, 3.3351571479220374, 0.7085739121391431, 0.6347486486017522, 0.7377284873952035, 0.8335972102916399, 0.727134797022488, 0.7050252934485406, 0.6203981064646984, 0.7428808478442308, 1.295473448106972, 1.334499885047726, 1.5316147849070298, 0.9440742533116802, 0.8232207493450039, 0.7824935560820985, 0.73186015323233, 2.3750895768845885, 0.811390991429295, 0.7399103127715305, 0.6841537285295146, 1.0334276038643886, 1.153924469446368, 0.977969959261133, 0.9043034583213787, 0.7171932283853757, 1.428642034763964, 0.7908961600559683, 0.8167786190219355, 1.0427741010266507, 1.0153354220947812, 0.8437703979852126, 0.8677248316975501, 3.897387651870235, 1.0106764983189107, 0.9608442893631927, 1.300937440018413, 1.2330441192891752, 3.0373230771101696, 0.6875609959318123, 0.7478039343639745, 0.9098418397112991, 0.6726740713048344, 2.704453617346402, 0.7528888883073613, 0.8499822673481012, 0.9272930363907904, 0.7977752187964774, 0.9571172800345298, 0.9009550792839461, 0.9076775167117986, 0.7685028125648107, 0.7255599207723424, 0.7644618889048127, 0.685891751890104, 0.6820508471426424, 0.6784809961264836, 0.7283352926939083, 1.0842459320593594, 1.2912924634404708, 0.8976676094353019, 0.9317799111975271, 0.9300129426423871, 0.7257803358523262, 0.6505847295492089, 0.6752359945057916, 0.7132430177821476, 0.7051704230236815, 0.9642814403189732, 0.7744847506606808, 0.8241155939532887, 0.8053246503449654, 0.9905026499060444, 1.0098798155890325, 0.9405068120172992, 0.9508761684770074, 0.8036721580270824, 1.0265270538433333, 0.9674331828962782, 0.7514923654659174, 0.768416489630283, 1.239500452706095, 1.384122735017081, 1.4120079442655804, 1.1739962390517318, 0.9655366518120655, 0.9345262678620644, 0.680083566182661, 0.7553759618581034, 0.6485331176452879, 0.862996441063931, 0.6995691406207311, 0.6767257601564834, 1.1046512054030682, 1.0087164366220271, 1.0114550497118864, 1.1916094752736364, 1.0470487014413614, 1.1764048076655809, 0.7083220367038057, 0.7301512310411714, 0.6951899254244314, 0.9328140938209916, 0.7453671211784787, 0.7076520057102882, 0.9288325108236056, 0.8417897669360912, 1.1457613831336961, 1.2756000996016925, 1.04763617039199, 1.3301201404004754, 0.6630661183327958, 0.7498980286764945, 0.7084368915358342, 0.9046831158048535, 0.7175918852915999, 0.7429344456048953, 0.8830928052977949, 1.562154547571137, 1.0074738567796786, 1.0671014568585053, 1.0930772063885665, 0.7264184619420785, 3.8595478923583606, 0.9952279539869434, 1.10485198515061, 1.1489063453528123, 0.9534531402883905, 0.9359931959958189, 0.8482511961518842, 0.8585750546288899] Mean skewness: 2.482476642311377
data_kurt_cells = []
for i in range(colN) :
name = data_log2.T[cnames[i]]
data_kurt_cells += [kurtosis(name)]
sns.histplot(data_kurt_cells, bins=100)
plt.xlabel('Kurtosis of single cells expression profiles - log based 2 df')
plt.show()
print( "Excess kurtosis of log based 2 distribution: ", data_kurt_cells)
print("Mean kurtosis:", np.mean(data_kurt_cells))
Excess kurtosis of log based 2 distribution: [-0.10313617588687274, 321.7228980621053, 0.699056587846735, 8.099655025021805, 0.46953390252257465, -0.2990000642606714, -0.7410539783404424, -0.8274756397974397, -0.7881491075967832, 5.421178737785846, 13.539953146793508, 6.444560654465231, -0.6242714510722296, -0.5407508720475307, -0.5410284782306705, 1.741747726877497, 1.915706932536172, 1.0786174910777584, 0.8999365173423133, -0.6209195978263264, -0.7650753790953289, 1.0625310481811487, -0.41726588866032976, 2001.121073581476, 0.7383702718357847, -0.37532443412558836, -0.6545311622131482, 0.34385686720299047, 0.1886697476446928, 1.4764414513263393, 1.0731646281971843, -0.30557722939786736, -0.7747894863554299, 2.916159144186106, 2.6759563919298897, 19.776659432488245, 1.256419994177966, -0.06792076908983402, 8.456681686016648, 1.6576032498416682, 1.156123278046656, 3.3580267356717908, -0.009889035599438323, -0.5523808121907101, 16.089408662853746, -0.04862013057481418, 5072.39111395831, 1.9866290982021528, 6.006651527805815, 0.03006616359321912, -0.42139896140469846, 201.8947106952415, 0.7550864541320541, -0.41868514764518183, 5.187573532630562, 2.66300956329681, 2.4102464436267477, 5.391907183711728, 3.0576741061829686, 2.783016506474584, 239.93512030770074, 0.9810060692202502, -0.2914922375455089, 2.271116031703275, 2.7553986145691054, 0.810802533077307, -0.38093387442111126, -0.7285934538880712, -0.9752412314289853, 9.964641789611111, 11.606493455263347, 0.021862686528268505, 1.0541516258563366, 5.222301333253647, 1756.9236448070544, 27.46605634736492, 1.8612323031015645, 2.8875128517700537, 2.3186005664705593, 92.85210944037682, 0.49021639600128264, 11.663212737971664, 0.7141451590781047, 3.2419188998913464, -0.5436749952687041, -0.034222056754949826, 0.30014525654406743, 22900.00004366052, 11.952108186824281, 2.7213429389855923, 0.24123058697882405, 1036.1373250487657, -0.7267927750189074, 72.10655772078697, 35.08537439360492, 0.7825385899748056, -0.547742568897398, -0.5694274968119588, 0.11503438446773995, -0.7223838869036556, -0.782132121349552, -0.8159324428904098, -0.4037137031607245, -0.7909170532481005, -0.7474670462440502, -0.5639295661051844, -0.30127871779081383, 0.1771340652984965, 0.6840492025945637, -0.6629599953252008, -0.3210517353494473, -0.7182892083997485, -0.7347618681705677, -0.6410326588690962, -0.855820947613084, -0.771489629510874, -0.7473209800146323, -0.38159511593252393, 739.6468958823361, -0.3113497231519031, -0.7243337016474647, -0.5368496957902935, 0.22197795558373734, -0.7340895413762163, -1.0219978328779065, -0.8738732637615598, -0.8949078579002983, -0.905420036382671, -0.9456433616327495, 0.5594499929366723, -0.7024889895801354, -0.7634128258643185, -0.4104322463156507, -0.7396898316067344, -0.8234579908846116, -0.6685978749829564, -0.853209046664364, -0.715331038096287, -0.6883158882769398, 21.040970536344872, -0.9441327070740804, 0.23581820404444498, 862.9560303885713, 0.5487780426269917, -0.5033598018693146, -0.8449575829116132, -0.39905064710769045, 0.6909298861785262, 0.5375878958942217, -0.931205814382952, -0.6230178667752191, -0.9976502844658341, -0.8999077956821959, -0.35434088395184427, -0.44738221886948004, -0.6116840731347244, -0.5813202376960622, -0.718782388550415, 1.2595529488018586, -0.9158769815149133, -0.8143978031614871, -0.9100045184982766, -0.9351440273445446, -0.7297558162326796, -0.794229983256777, -0.6228503380229222, -0.709875949805753, -0.42332433960107574, -0.6975955758220551, 0.11982950691512118, -0.017459652856274044, -0.06673350158389946, -0.8667007163728799, -0.15132112360857342, -0.7743359621420218, -0.9316270576310393, -0.885311611945252, 1.105526086037747, -0.3422802466308479, 0.16835197698860194, -0.28856897156702654, -0.3889601553214028, 8.387310806558014, -0.5201582309152117, -0.5857197272579562, -0.16302468730124886, -0.5674500912065126, -0.612592176589434, -0.7787527030456558, 0.9953024122791319, -0.8820977131306877, 0.38147886692872346, -0.35743833404651415, 0.09465433708048554, -0.050253710037512445, -0.8451755942064021, -0.808639104728917, -0.6819768513671112, -0.7601061092280585, -0.6206495249753177, -0.9175593725974567, -0.6188645056299809, -0.6758184723411573, 0.18606009094067133, -0.7605585197489644, 0.4857407065639259, 1.1142746463500446, -0.8265286050647944, -0.732840042295066, -0.8062148693117428, -0.8520820352662013, -0.9622683966896797, -0.7659200485307234, -0.13846742042932014, -0.7941869444358503, 0.1470010174637384, -0.513386080300831, -0.37347462667568543, 0.07186501755670571, -0.8725549407974356, -0.8948424207827994, -0.9224026417452804, -0.7213395658043318, -0.540338252478052, -0.8999303806797583, -0.394152817778584, -0.5686612594025964, -0.2627456026941619, 2.3253545451058617, -0.6374659118281509, 0.23465657668124384, -0.8578049232764116, -0.8630494300054758, 44.336024902027425, 14.940175155467912, -0.5667127694630034, -0.28978576866301475, -0.6666520771728086, -0.22521994570655002, 0.1382973586173648, 699.9732706697993, -0.3972841501931321, -0.658036233259867, -0.9836441445313713, -0.640678236023096, 2188.6149320216564, -1.0074123286787162, -0.8454796115124523, -0.7195133865700365, 983.8652125036289, -0.17156529432468748, -0.5944868599465791, -0.5198727136714085, 10.379099763310935, -0.912741258478273, -0.9650393850172625, -0.9209331179593252, -0.7388411692580976, -0.9246396608797824, -0.8604327161400374, -1.0456559445540703, -0.8430046245582656, 0.3559814037495399, 0.6596691077737087, 0.9731997891083495, -0.29906040283010515, -0.5595308442751086, -0.8497142624191487, -0.8688166851118142, 4.44295375304512, -0.8513521625698854, -0.854435915858982, -0.9236712273375449, -0.2769357142827853, 0.01870304675246315, -0.19999634032372748, -0.6458767253983315, -0.8069314006211439, 1.5483204659822878, -0.641785683066169, -0.6345995090696617, 0.05330083733174096, 0.0543830020135907, -0.6434528604956729, -0.5198965477421988, 21.07013061939895, -0.32273446103224535, -0.5564216283351615, 0.5715719399440187, 0.33632963796723514, 9.267595867787088, -0.9011587324342014, -0.9003706910940039, -0.6310042188712961, -0.9208619013962123, 7.459628431670737, -0.6627546750644018, -0.6109922221719404, -0.583522400106772, -0.7670289266322348, -0.3603082442151151, -0.3350399833610429, -0.410601173477827, -0.8153450647490583, -0.8410837592855973, -0.812681877060355, -0.9328820276940091, -1.039482030975073, -0.8927929840156996, -0.8135022999324342, -0.30642314158712836, 0.28850415713596034, -0.6230890529688051, -0.4652109939367479, -0.497122961366677, -0.9244516199851809, -1.0139825248107468, -0.8884331847252054, -0.8865349516187715, -0.9374998044826643, -0.23060462200118437, -0.7843244948390655, -0.690973254867481, -0.6972758440544329, -0.23052798245558348, -0.13114995648226513, -0.5532078740077413, -0.3601498698727932, -0.6515734804131323, 0.06504553412369285, -0.31139815519708014, -0.6292356290920402, -0.745989425820829, 0.4551683474934425, 0.5095305926387228, 0.7133800255466571, 0.09929873715246407, -0.26596204546266167, -0.23986657798665645, -0.8707779095774866, -0.8172340474385011, -0.9266190225331878, -0.7072110196218127, -0.8387146189383641, -0.8386821621426592, 0.04545795850787826, -0.4404062837191307, -0.24759413349008863, 0.43748770399771786, -0.13417654036254234, 0.2120325304908861, -0.9207123063084084, -0.906988355332238, -0.7812962205104244, -0.363237301246484, -0.8082383652261389, -0.7365276039931477, -0.3436011425489669, -0.577054100728843, 0.20232716281881746, 0.4529413640910094, 0.053842718012477864, 0.656067483082936, -0.9307176843339438, -0.8688475184248632, -0.8846004433658998, -0.49994765239269956, -0.9040991920969934, -0.7691041290405685, -0.37427917143847633, 1.1660951929252397, -0.1616462893750863, -0.11295894063272938, 0.24574499831421903, -0.8703293431465919, 14.361884766320088, -0.28731547365797416, 0.04125246646855141, 0.2920720949416209, -0.5265149853062501, -0.5043208109065032, -0.7299227326652322, -0.6611764440126167] Mean kurtosis: 103.02626328733729
HCC1806 SmartSeq experiment: after applying a log based 2 transformation, we obtain mean skewness = 1.940801299270791 and mean kurtosis = 59.442962646309546.
Taking a log transformation, the dataset is still not perfectly following a normal distribution but the resulting values of skewness and kurtosis are lower than the ones of the original dataset: rescaling it is therefore a good idea by the reasoning above.
data = data_log2
As our previous analysis shows, we have to filter out cells that show low activity i.e. low gene reads.
row_sum = data.sum(axis=1)
counter = 0
for x in row_sum:
if x == 0:
counter += 1
print(counter)
0
There are no cells that have 0 expression of every gene, but still as we have seen before there are cells that have very low gene reads, so they should be removed since they are anomalous.
To understand which cells are anomalous, we decide to make a plot representing the total counts of genes vs the number of expressed ones for each cell.
# Step 1: Calculate total counts of genes for each cell (sum of all elements in each row of the matrix)
total_gene_counts = data.sum(axis=1)
# Step 2: Calculate number of expressed genes for each cell (count of non-zero elements in each row of the matrix)
expressed_genes = (data != 0).sum(axis=1)
# Step 3: Create a scatter plot
plt.scatter(total_gene_counts, expressed_genes)
plt.xlabel('Total Counts of Genes')
plt.ylabel('Number of Expressed Genes')
plt.title('Gene Expression Scatter Plot')
plt.axvline(x=30000, color='salmon', linestyle='--')
plt.axvline(x=63000, color='salmon', linestyle='--')
plt.axhline(y=5000, color='salmon', linestyle='--')
plt.show()
From this plot, we can define as 'outlier' cells the ones with:
We defined some bounds simply looking at the plot, in a heuristic way, so it may not be completely accurate.
To remove these cells we have identified, we create a copy of data so that we do not modify the original dataframe while working on it.
data_copy = data.copy()
original_indices = data_copy.index
# Calculate total counts of genes for each cell
total_gene_counts = data_copy.sum(axis=1)
# Calculate number of expressed genes for each cell
expressed_genes = (data_copy != 0).sum(axis=1)
total_counts_range = (30000, 63000)
expressed_genes_range = (5000, 14000)
data_copy['total_counts'] = total_gene_counts
data_copy['expressed_genes'] = expressed_genes
# Filter the dataset to select only the rows within the specified range
data_filtered = data_copy.loc[
(data_copy['total_counts'] >= total_counts_range[0]) & (data_copy['total_counts'] <= total_counts_range[1]) &
(data_copy['expressed_genes'] >= expressed_genes_range[0]) & (data_copy['expressed_genes'] <= expressed_genes_range[1])
].reset_index(drop=True)
original_indices = original_indices[(data_copy['total_counts'] >= total_counts_range[0]) & (data_copy['total_counts'] <= total_counts_range[1]) &
(data_copy['expressed_genes'] >= expressed_genes_range[0]) & (data_copy['expressed_genes'] <= expressed_genes_range[1])
]
data_filtered = data_filtered.iloc[:, :22905] # remove columns total_counts and expressed_genes
data_filtered.index = original_indices
We visualize again the plot and do some other checks to verify that we correctly removed the cells we defined as 'outliers'.
# Step 1: Calculate total counts of genes for each cell
total_gene_counts = data_filtered.sum(axis=1)
# Step 2: Calculate number of expressed genes for each cell
expressed_genes = (data_filtered != 0).sum(axis=1)
# Step 3: Create a scatter plot
plt.scatter(total_gene_counts, expressed_genes)
plt.xlabel('Total Counts of Genes')
plt.ylabel('Number of Expressed Genes')
plt.title('Gene Expression Scatter Plot')
plt.axvline(x=30000, color='salmon', linestyle='--')
plt.axvline(x=63000, color='salmon', linestyle='--')
plt.axhline(y=5000, color='salmon', linestyle='--')
plt.show()
total_gene_counts_filtered = data_filtered.sum(axis=1)
expressed_genes_filtered = (data_filtered != 0).sum(axis=1)
for x in total_gene_counts_filtered:
assert x >= 30000
assert x <= 63000
for x in expressed_genes_filtered:
assert x >= 5000
# we assert that all the values are in the correct range, so we have removed the outliers
data_filtered.head()
| WASH7P | MIR6859-1 | WASH9P | OR4F29 | MTND1P23 | MTND2P28 | MTCO1P12 | MTCO2P12 | MTATP8P1 | MTATP6P1 | ... | MT-TH | MT-TS2 | MT-TL2 | MT-ND5 | MT-ND6 | MT-TE | MT-CYB | MT-TT | MT-TP | MAFIP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 1.584963 | 1.584963 | 0.0 | 0.0 | 4.906891 | ... | 0.0 | 0.0 | 0.0 | 8.982994 | 7.209453 | 2.321928 | 8.082149 | 0.0 | 2.584963 | 3.169925 |
| output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 1.000000 | 1.000000 | 1.0 | 0.0 | 3.700440 | ... | 0.0 | 0.0 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 6.266787 | 0.0 | 0.000000 | 0.000000 |
| output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 6.108524 | ... | 0.0 | 0.0 | 0.0 | 7.894818 | 5.000000 | 2.000000 | 9.507795 | 0.0 | 0.000000 | 0.000000 |
| output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 1.000000 | 2.000000 | 0.0 | 0.0 | 7.672425 | ... | 1.0 | 0.0 | 0.0 | 9.805744 | 6.832890 | 2.000000 | 11.408330 | 1.0 | 1.000000 | 0.000000 |
| output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam | 0.0 | 0.0 | 3.459432 | 0.0 | 0.0 | 2.000000 | 3.459432 | 1.0 | 0.0 | 9.434628 | ... | 1.0 | 0.0 | 2.0 | 10.321928 | 7.044394 | 0.000000 | 13.187197 | 1.0 | 1.000000 | 0.000000 |
5 rows × 22905 columns
data = data_filtered
HCC1806 experiment: we do the same procedure, using as ranges: total_counts_range = (41000, 80000) and expressed_genes_range = (7100, 13300).
We need to notice that each individual cell was sequenced independently; this implies the possibility that the data may require normalization across cells. Normalization is the process of transforming a dataset to a common scale. This tranformation does not always lead to have a Gaussian distribution, but this is accettable as explained before.
Let's plot the the gene expression distributions of some selected cells from our dataset.
data_small = data.T.iloc[:, :20] #just selecting part of the samples so run time not too long
sns.displot(data=data_small,palette="Set3",kind="kde", bw_adjust=2)
<seaborn.axisgrid.FacetGrid at 0x134e2f190>
We can see that the distribution of each cell shows two peaks: this is expected since they represent genes of low and high abundance.
If we visualize the distribution of a single cell, we can clearly see this behaviour.
data_small_cell = data.loc['output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam']
sns.displot(data_small_cell, kind="kde", bw_adjust=2)
plt.show()
In order to compare the plots of our dataset (that we have filtered in the previous steps), we open a filtered-normalized dataset of the same experiment.
norm_df = pd.read_csv("/Users/ela/Documents/AI_LAB/SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt",delimiter="\ ",engine='python',index_col=0)
norm_df = norm_df.T
print("Dataframe dimesions:", np.shape(norm_df))
Dataframe dimesions: (250, 3000)
Since we took a log transformation on our dataset, let's do the same with the normalized one to have the plots on a similar scale.
norm_df = np.log2(norm_df+1)
norm_df_small = norm_df.T.iloc[:, :20] #just selecting part of the samples so run time not too long
sns.displot(data=norm_df_small,palette="Set3",kind="kde", bw_adjust=2)
plt.show()
Again, let's visualize the plot of the previously chosen single cell for these datasets:
norm_small_cell1 = norm_df.loc['"output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam"']
sns.displot(norm_small_cell1, kind="kde", bw_adjust=2)
plt.show()
The plots of the normalized data seem already quite similar to the ones of our dataset; let's try to apply some normalization technique and see how they change. We choose to apply StandardScaler, since it is not so much affected by outliers and standard approach is easily interpretable by a biologist.
Using StandardScaler, we subtract the mean value and divide by variance for every row.
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Initialize the StandardScaler object
scaler = StandardScaler()
# Fit the scaler to the data and transform it
data_standardized = scaler.fit_transform(data.T)
data_standardized = pd.DataFrame(data_standardized.T, columns=data.columns, index=data.index)
data_standardized.head()
| WASH7P | MIR6859-1 | WASH9P | OR4F29 | MTND1P23 | MTND2P28 | MTCO1P12 | MTCO2P12 | MTATP8P1 | MTATP6P1 | ... | MT-TH | MT-TS2 | MT-TL2 | MT-ND5 | MT-ND6 | MT-TE | MT-CYB | MT-TT | MT-TP | MAFIP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | -0.689509 | -0.689509 | -0.325147 | -0.689509 | -0.689509 | -0.112008 | -0.112008 | -0.689509 | -0.689509 | 1.098378 | ... | -0.689509 | -0.689509 | -0.689509 | 2.583558 | 1.937346 | 0.156514 | 2.255324 | -0.689509 | 0.252354 | 0.465493 |
| output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | -0.566285 | -0.566285 | -0.566285 | -0.566285 | -0.566285 | -0.239710 | -0.239710 | -0.239710 | -0.566285 | 0.642186 | ... | -0.566285 | -0.566285 | -0.566285 | -0.239710 | -0.566285 | -0.566285 | 1.480290 | -0.566285 | -0.566285 | -0.566285 |
| output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | -0.725110 | -0.725110 | -0.725110 | -0.725110 | -0.725110 | -0.725110 | -0.725110 | -0.725110 | -0.725110 | 2.103780 | ... | -0.725110 | -0.725110 | -0.725110 | 2.931022 | 1.590416 | 0.201101 | 3.678000 | -0.725110 | -0.725110 | -0.725110 |
| output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam | -0.741357 | -0.741357 | -0.400914 | -0.741357 | -0.741357 | -0.400914 | -0.060471 | -0.741357 | -0.741357 | 1.870667 | ... | -0.400914 | -0.741357 | -0.741357 | 2.596941 | 1.584853 | -0.060471 | 3.142530 | -0.400914 | -0.400914 | -0.741357 |
| output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam | -0.802721 | -0.802721 | 0.289478 | -0.802721 | -0.802721 | -0.171288 | 0.289478 | -0.487005 | -0.802721 | 2.175946 | ... | -0.487005 | -0.802721 | -0.171288 | 2.456082 | 1.421310 | -0.802721 | 3.360694 | -0.487005 | -0.487005 | -0.802721 |
5 rows × 22905 columns
data_stand_df_small = data_standardized.T.iloc[:, :20] #just selecting part of the samples so run time not too long
sns.displot(data_stand_df_small,palette="Set3",kind="kde", bw_adjust=2)
plt.show()
data_stand_cell= data_standardized.loc['output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam']
sns.displot(data_stand_cell, kind="kde", bw_adjust=2)
plt.show()
Let's compute the values of skeweness and kurtosis of the standardized dataset:
print( "Skeweness: ", skew(data_standardized))
print("Mean skeweness:", np.mean(skew(data_standardized)))
print()
print( "Kurtosis: ", kurtosis(data_standardized))
print("Mean kurtosis:", np.mean(kurtosis(data_standardized)))
Skeweness: [ 3.17808611 3.01766603 0.5686532 ... 0.56069821 -0.05658675 1.06748017] Mean skeweness: 1.320638567645328 Kurtosis: [15.5641668 19.7366884 -0.52045079 ... -0.57930253 -0.75553775 0.15421078] Mean kurtosis: 10.02160514345388
The resulting values are quite low: the distribution is still non-normal, but we reduced a lot the skeweness compared to the original dataset's one and also compared to the values we obtained just taking the log transformation. This is a good result, since high skeweness values may lead to problems, as already pointed out.
In conclusion, standardization seems to be a good way to scale our parameters, so we decide to apply it.
data = data_standardized
HCC1806 experiment: the same conclusion applies, since we find mean skeweness of 1.2134374660939942 and mean kurtosis of 10.25664719668806.
data.shape
(316, 22905)
Another important part of our analysis is the selection of genes that are connected to the Hypoxia and Normoxia conditions. We can try to select them using the concepts of entropy and information gain: the most important genes are the ones that give us the highest values of information gain.
Information gain is a measure used to quantify the usefulness of a feature (in this case, a gene) in predicting the target variable ('Hypoxia' or 'Normoxia').
merge = data.merge(data_meta, left_index=True, right_index=True, how="inner")
data_lab = merge.drop(["Cell Line", "Lane", "Pos", "Hours", "PreprocessingTag", "ProcessingComments", "Cell name"], axis=1)
data_lab["Condition"]
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam Hypo
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam Hypo
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam Norm
output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam Norm
output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam Norm
...
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam Norm
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam Norm
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam Hypo
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam Hypo
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam Hypo
Name: Condition, Length: 316, dtype: object
data_genes = data.T
n = len(data.index)
We now calculate the information gain for each gene, using the corresponding target variable data_lab["Condition"] (so the label 'Hypoxia' or 'Normoxia'). We can see that approximately the first 3000 genes of this list (3414) are the ones that are the most useful in our prediction. We thus keep those with information gain higher than 0.215 and visualize this in a plot:
from sklearn.feature_selection import mutual_info_classif
# Calculate the information gain for each gene
information_gain = mutual_info_classif(data, data_lab["Condition"])
# Sort the genes based on their information gain (descending order)
sorted_genes = np.argsort(information_gain)[::-1]
sorted_genes = sorted_genes[:3415]
# Print the selected genes
for gene_index in sorted_genes:
print(f"{data.columns[gene_index]}: Information Gain = {information_gain[gene_index]}")
NDRG1: Information Gain = 0.652850010929818 BNIP3: Information Gain = 0.6293716987199368 HK2: Information Gain = 0.6229342045507877 P4HA1: Information Gain = 0.6182155449226839 GAPDHP1: Information Gain = 0.6146039495703683 BNIP3L: Information Gain = 0.6112144648963216 MT-CYB: Information Gain = 0.6071714290100367 MT-CO3: Information Gain = 0.6069431008363095 FAM162A: Information Gain = 0.5978657302685877 LDHAP4: Information Gain = 0.596249888004365 ENO2: Information Gain = 0.5901785407123785 HILPDA: Information Gain = 0.5893623976490796 ERO1A: Information Gain = 0.5885970357759547 PDK1: Information Gain = 0.5848792556958529 PGK1: Information Gain = 0.5829428094806763 VEGFA: Information Gain = 0.5788163036952783 C4orf3: Information Gain = 0.5759660862000839 LDHA: Information Gain = 0.5705938251485365 KDM3A: Information Gain = 0.567630555850209 DSP: Information Gain = 0.5673717635860669 PFKP: Information Gain = 0.5658814558901728 PFKFB3: Information Gain = 0.5621446524474103 DDIT4: Information Gain = 0.5587321183145155 PFKFB4: Information Gain = 0.5563513951856269 GAPDHP65: Information Gain = 0.5518422869034343 CYP1B1: Information Gain = 0.5478279572893794 GPI: Information Gain = 0.5463414129938879 MTATP6P1: Information Gain = 0.5433365961379271 CYP1B1-AS1: Information Gain = 0.5399181599619258 AK4: Information Gain = 0.5313380424537746 IRF2BP2: Information Gain = 0.5262810927773491 BNIP3P1: Information Gain = 0.5231721303178609 MT-ATP8: Information Gain = 0.5227929190054736 MXI1: Information Gain = 0.521986729924979 MT-ATP6: Information Gain = 0.5159273549148862 TLE1: Information Gain = 0.5121431131814433 FUT11: Information Gain = 0.5079246329268166 RIMKLA: Information Gain = 0.5075676127304747 UBC: Information Gain = 0.5017494409121412 IFITM2: Information Gain = 0.49174473971312427 CIART: Information Gain = 0.4838288131062345 TES: Information Gain = 0.48314077448680215 HK2P1: Information Gain = 0.48164542987256986 HIF1A-AS3: Information Gain = 0.48019147805755935 GBE1: Information Gain = 0.4682404929095607 MYO1B: Information Gain = 0.4671360917422964 GAPDH: Information Gain = 0.4652759935255599 P4HA2: Information Gain = 0.4612358739432907 SLC2A1: Information Gain = 0.45713553373714055 PGK1P1: Information Gain = 0.45594851318528784 ITGA5: Information Gain = 0.455490388642255 NFE2L2: Information Gain = 0.45355037214793414 ALDOA: Information Gain = 0.4534648457013917 RSBN1: Information Gain = 0.4478332979003772 MT-TK: Information Gain = 0.4427776960928569 EIF1: Information Gain = 0.43738209181343946 FDPS: Information Gain = 0.4364298819238388 STC2: Information Gain = 0.4354350129808717 DYNC2I2: Information Gain = 0.4306562276949879 MT-CO2: Information Gain = 0.4297002077703005 PGAM1: Information Gain = 0.4288020568529851 TMEM45A: Information Gain = 0.4280475241919677 ENO1: Information Gain = 0.4256631037747065 ALDOAP2: Information Gain = 0.4239391871018834 PTPRN: Information Gain = 0.4238279266457914 MIR210HG: Information Gain = 0.42375072209394915 RUSC1-AS1: Information Gain = 0.42107955859357005 FOSL2: Information Gain = 0.4209215935506061 C8orf58: Information Gain = 0.4201690137287297 PYCR3: Information Gain = 0.41982872951443606 ELOVL2: Information Gain = 0.4188648524431189 RAP2B: Information Gain = 0.4188411585428118 HLA-B: Information Gain = 0.4188350913738763 BHLHE40: Information Gain = 0.4186440514180141 RIOK3: Information Gain = 0.4181540271030093 BHLHE40-AS1: Information Gain = 0.41804350926163614 KRT80: Information Gain = 0.4165622772857305 SOX4: Information Gain = 0.4156009586455778 P4HA2-AS1: Information Gain = 0.4144323738931335 CYP1A1: Information Gain = 0.4132153269937726 USP3: Information Gain = 0.4121497462788888 SNRNP25: Information Gain = 0.41099315828642125 TNFRSF21: Information Gain = 0.41085701897253313 TANC2: Information Gain = 0.4101959471566188 PSME2: Information Gain = 0.40907793829418426 GAREM1: Information Gain = 0.40857925799250605 IER5L: Information Gain = 0.4069949408536011 AK1: Information Gain = 0.4050086126044392 WDR45B: Information Gain = 0.40402460462585 EGLN3: Information Gain = 0.4031746943669017 PGK1P2: Information Gain = 0.40306239617236583 EGLN1: Information Gain = 0.40203268782978707 GAPDHP72: Information Gain = 0.4008426894940911 PGP: Information Gain = 0.3988082639364958 CEBPG: Information Gain = 0.3980669873046683 SPOCK1: Information Gain = 0.39798202402055716 IFITM3: Information Gain = 0.397474754203881 DAPK3: Information Gain = 0.3973120603185849 GNA13: Information Gain = 0.39673965316054893 HLA-C: Information Gain = 0.39654383507888125 ACTG1: Information Gain = 0.3964875138159565 NAMPT: Information Gain = 0.39614301400784724 DSCAM-AS1: Information Gain = 0.39605419573237866 CLK3: Information Gain = 0.3954574889605338 SLC9A3R1: Information Gain = 0.39517935216581423 PNRC1: Information Gain = 0.39363140575457756 IGFBP3: Information Gain = 0.3931719534188891 SPRY1: Information Gain = 0.3925983191882785 MIR6892: Information Gain = 0.3923074540317111 NEBL: Information Gain = 0.3923034009119848 BBC3: Information Gain = 0.39161078593796916 PGM1: Information Gain = 0.3911060402431925 ADM: Information Gain = 0.39087034106792773 QSOX1: Information Gain = 0.3867698822775105 DARS1: Information Gain = 0.3857501826512435 MKNK2: Information Gain = 0.38513592976897404 SLC27A4: Information Gain = 0.38488776243527867 EML3: Information Gain = 0.3834618921920767 EMP2: Information Gain = 0.38236422581907115 SDF2L1: Information Gain = 0.38158946030759044 ST3GAL1: Information Gain = 0.3807268894099909 TGIF1: Information Gain = 0.37842615624904785 GAPDHP70: Information Gain = 0.37807564586542686 MRPL4: Information Gain = 0.37775753623058184 DAAM1: Information Gain = 0.37772388209605534 LY6E: Information Gain = 0.37718353502666613 IDI1: Information Gain = 0.3764664715381756 TST: Information Gain = 0.37402423295649734 SLC9A3R1-AS1: Information Gain = 0.3733853335277135 IFITM1: Information Gain = 0.373342356138197 HNRNPA2B1: Information Gain = 0.37275788251115194 CCNG2: Information Gain = 0.3726036381922173 TRAPPC4: Information Gain = 0.37222042467406746 VLDLR-AS1: Information Gain = 0.37166728309810226 GAPDHP60: Information Gain = 0.3715842521944981 LSM4: Information Gain = 0.36904358922180713 NCK2: Information Gain = 0.36878149996260157 ARPC1B: Information Gain = 0.36826773787403444 GABARAP: Information Gain = 0.36795610837167536 LDHAP7: Information Gain = 0.36792313803617516 TSC22D2: Information Gain = 0.36789543271156844 PRELID2: Information Gain = 0.36715733118201443 MSANTD3: Information Gain = 0.3671229762959629 RAD9A: Information Gain = 0.36659651527949566 POLR1D: Information Gain = 0.3662021528445567 MIR3615: Information Gain = 0.3661494301488195 CA9: Information Gain = 0.365619373506876 PSME2P2: Information Gain = 0.36559447938884904 MKRN1: Information Gain = 0.36508193097313857 CTPS1: Information Gain = 0.36402593760931246 NTN4: Information Gain = 0.363195548010562 NDUFS8: Information Gain = 0.3628959134016638 LDHAP2: Information Gain = 0.3625219110136326 NDUFB8: Information Gain = 0.362078184156331 ZNF292: Information Gain = 0.36197735510717166 SRM: Information Gain = 0.36187523886220085 BTG1: Information Gain = 0.36170170174064564 OSER1: Information Gain = 0.36161506766626816 ELF3: Information Gain = 0.36096679463213244 CTNNA1: Information Gain = 0.3608324057666934 RNF183: Information Gain = 0.3604693097536713 DHRS3: Information Gain = 0.3603187937028458 MIR7703: Information Gain = 0.36012296500030905 KCMF1: Information Gain = 0.3601103841923736 FTL: Information Gain = 0.3595725131269818 C2orf72: Information Gain = 0.35904665335610564 DDIT3: Information Gain = 0.3586628397863165 STK38L: Information Gain = 0.35813983280888495 SMAD2: Information Gain = 0.35749187421427986 EGILA: Information Gain = 0.35660910740014384 SMAD9: Information Gain = 0.35656762712923107 IL27RA: Information Gain = 0.35654257378000187 FAM110C: Information Gain = 0.35629791054717197 RBPJ: Information Gain = 0.3558217952029479 ESYT2: Information Gain = 0.35545064857981057 TUBD1: Information Gain = 0.35493934989054643 ZNF160: Information Gain = 0.3546031869928732 PKM: Information Gain = 0.35424225765755923 TGFBI: Information Gain = 0.35404245709042814 TMSB10: Information Gain = 0.35389879248481226 MACC1: Information Gain = 0.3529591099199798 PAM: Information Gain = 0.3527214546612061 IGDCC3: Information Gain = 0.35267208585575793 ZYX: Information Gain = 0.3511749011782481 HMOX1: Information Gain = 0.35104895398183356 HELLS: Information Gain = 0.3509937256917899 SFXN2: Information Gain = 0.35056741535459324 FNIP1: Information Gain = 0.3493812174923647 GAPDHP61: Information Gain = 0.348093168337664 TPD52: Information Gain = 0.34801790899781015 CRELD2: Information Gain = 0.3477293376504813 TXNRD1: Information Gain = 0.34757296411238014 RORA: Information Gain = 0.3464872207638918 WASF2: Information Gain = 0.3463019058808874 RAMP1: Information Gain = 0.34619204392200165 RND3: Information Gain = 0.3460187732111957 ZNF395: Information Gain = 0.3460147086576546 FYN: Information Gain = 0.3456459956615583 GAPDHP63: Information Gain = 0.3450614953671438 UHRF1: Information Gain = 0.3450421573420035 TUBG1: Information Gain = 0.3444244037995772 EIF4A2: Information Gain = 0.34418989125064314 KLF3: Information Gain = 0.3439628991066539 RHOD: Information Gain = 0.34392335356920567 DAPP1: Information Gain = 0.3438721363888555 AVL9: Information Gain = 0.3433941577119186 SLC3A2: Information Gain = 0.3432587678042627 TFG: Information Gain = 0.34274340074686105 TCAF2P1: Information Gain = 0.34273958916807645 RCAN3: Information Gain = 0.34270173540916193 PPP1CA: Information Gain = 0.3426022649570626 MIR5047: Information Gain = 0.34205982873118934 LRR1: Information Gain = 0.3416935755454038 YEATS2: Information Gain = 0.34168645726383096 MYL12A: Information Gain = 0.34157029777984205 BEST1: Information Gain = 0.3415170720983203 CLDND1: Information Gain = 0.34142110741240894 NUPR1: Information Gain = 0.3412992639358141 ARFGEF3: Information Gain = 0.3411195798043589 FTH1: Information Gain = 0.3407931761362468 HMBS: Information Gain = 0.340591026039869 DUSP10: Information Gain = 0.3393194040131773 ALOX5AP: Information Gain = 0.3390106069880774 VLDLR: Information Gain = 0.3389803961087652 SINHCAF: Information Gain = 0.3382079122985737 RPL17P50: Information Gain = 0.3381513455473184 RNF19B: Information Gain = 0.33780075663637854 ZFAS1: Information Gain = 0.3374762237584428 FASN: Information Gain = 0.3369478110769608 PGM2L1: Information Gain = 0.33654859510549184 RRAGD: Information Gain = 0.3365130732238193 MYRIP: Information Gain = 0.33627803887644836 GGCT: Information Gain = 0.3360943731512249 KLF3-AS1: Information Gain = 0.3359169414860945 DCXR: Information Gain = 0.3350225570572125 TLE1P1: Information Gain = 0.3347659397844036 CDC42EP1: Information Gain = 0.3347134397284013 RPL34: Information Gain = 0.33450125917677065 PCAT6: Information Gain = 0.33443298651000264 EBP: Information Gain = 0.3341287613252497 DUSP4: Information Gain = 0.3338982090514582 CHD2: Information Gain = 0.33343118788088444 ANGPTL4: Information Gain = 0.33175154774551063 RUNX1: Information Gain = 0.331369692996182 INSIG2: Information Gain = 0.33104854544093665 PHLDA3: Information Gain = 0.33104317400127603 GAPDHP40: Information Gain = 0.33101985240964926 RANBP1: Information Gain = 0.33090936165847284 POLR2L: Information Gain = 0.3308256427945433 RNASE4: Information Gain = 0.33079346977729474 DNPH1: Information Gain = 0.33067321033645514 HPDL: Information Gain = 0.330410292103855 POP5: Information Gain = 0.3296129274701629 ATP5F1D: Information Gain = 0.3291292645142707 THAP8: Information Gain = 0.3284083698130156 WEE1: Information Gain = 0.3283100110995478 CCNI: Information Gain = 0.3282750713628402 SLC29A1: Information Gain = 0.32796683907426116 TRIB3: Information Gain = 0.32683887210551266 KLF7: Information Gain = 0.3268368450397412 FOXO3: Information Gain = 0.326681880222085 PSME2P1: Information Gain = 0.3262130833146015 GNAS-AS1: Information Gain = 0.32564097711994533 FAM220A: Information Gain = 0.3254310982002362 ZNF12: Information Gain = 0.3253328113026219 NUDT5: Information Gain = 0.32505185344772247 MFSD3: Information Gain = 0.3249535962345529 ANG: Information Gain = 0.3248630408018913 DOK7: Information Gain = 0.3245204183557826 PRMT6: Information Gain = 0.32450453828930503 FBXL6: Information Gain = 0.3238838064446756 ELOVL6: Information Gain = 0.3237582352083239 VDAC1: Information Gain = 0.3236953619518801 STRA6: Information Gain = 0.3234145245361739 ASNSP1: Information Gain = 0.32335962986940836 HNRNPAB: Information Gain = 0.32332925710335103 CAPN2: Information Gain = 0.3224888605805294 SLITRK6: Information Gain = 0.32244457539653526 GRB10: Information Gain = 0.32242624636886474 FEN1: Information Gain = 0.32136379373198953 FBXO42: Information Gain = 0.32079991093889904 SLC25A36: Information Gain = 0.320694633789792 CDC42EP3: Information Gain = 0.32067954277740784 GET1: Information Gain = 0.32013105592496616 PCBP1-AS1: Information Gain = 0.32010926048400723 FOXO1: Information Gain = 0.3199641336740555 HEY1: Information Gain = 0.31994812160074027 FAM13A: Information Gain = 0.31990516138402025 BCL10: Information Gain = 0.3198850248481111 FBXO16: Information Gain = 0.3197602617558919 PDZK1: Information Gain = 0.31970430083513146 PTGER4: Information Gain = 0.31960954182225954 TFRC: Information Gain = 0.3195625112278613 KDM5B: Information Gain = 0.3188760416491607 GINS2: Information Gain = 0.3183553236150536 VPS37D: Information Gain = 0.31799650745238695 ADCY9: Information Gain = 0.31767419372839734 LRATD2: Information Gain = 0.3171161426448659 NDUFC2: Information Gain = 0.3167477729169792 NECAB1: Information Gain = 0.3161118054466716 TKFC: Information Gain = 0.3158320063818574 TRIM16: Information Gain = 0.3157956379198734 CDC45: Information Gain = 0.31530942073784374 LINC02649: Information Gain = 0.3152550893733177 TMEM265: Information Gain = 0.31503321958122377 EDN2: Information Gain = 0.31454691204912977 DENND11: Information Gain = 0.3143727433403063 SRF: Information Gain = 0.31422955560186216 GPS1: Information Gain = 0.3141412830672585 FAM13A-AS1: Information Gain = 0.31299800655638 PDLIM5: Information Gain = 0.3129442000325542 KLHL2P1: Information Gain = 0.31264732761354486 ATP5MC1: Information Gain = 0.3125339901952562 ZBTB21: Information Gain = 0.312330833013021 CFD: Information Gain = 0.3120811544099096 EMX1: Information Gain = 0.31189765444211415 PLBD1: Information Gain = 0.3118961097736972 PTPRH: Information Gain = 0.31171136273245437 ATP5F1E: Information Gain = 0.3114680363353515 APEH: Information Gain = 0.3111802559311103 TCAF2: Information Gain = 0.3110020449343913 MAP1B: Information Gain = 0.31090027335479387 TMEM64: Information Gain = 0.3106699917456266 NECTIN2: Information Gain = 0.3106558505166437 NDUFS6: Information Gain = 0.3105691445312495 TMEM123: Information Gain = 0.3105040936939487 CERS4: Information Gain = 0.3099918617220536 LDHAP3: Information Gain = 0.3098170964458886 CD55: Information Gain = 0.30978055710542063 EIF4EBP1: Information Gain = 0.3097658298422141 PAGR1: Information Gain = 0.3097245388671901 ADAMTS19-AS1: Information Gain = 0.30965733820028163 SEC31A: Information Gain = 0.3095088742066143 FADS1: Information Gain = 0.3094381943409463 GPNMB: Information Gain = 0.30917942831103984 MSANTD3-TMEFF1: Information Gain = 0.30910206582142385 CHMP4C: Information Gain = 0.30900883685448766 TMEM65: Information Gain = 0.3084778644514743 IMMP2L: Information Gain = 0.30846963605807076 RLF: Information Gain = 0.3082882382988481 GAD1: Information Gain = 0.30817765657000007 SDAD1P1: Information Gain = 0.3080022192929237 ANKRD12: Information Gain = 0.307882980052651 SNX27: Information Gain = 0.3075631715286389 RPL21: Information Gain = 0.30736080975371194 ASF1B: Information Gain = 0.30702111713878777 C1QBP: Information Gain = 0.3068227461494053 DHCR7: Information Gain = 0.30676716592698194 FADS2: Information Gain = 0.30660423965720307 ACLY: Information Gain = 0.306483535366455 CENATAC-DT: Information Gain = 0.3064441246016518 FTH1P16: Information Gain = 0.30600373891133614 H2AX: Information Gain = 0.3057210705939095 VEGFC: Information Gain = 0.3053664819969162 LOXL2: Information Gain = 0.304988153436617 MYO1E: Information Gain = 0.30366960851032876 CCDC28B: Information Gain = 0.3036646783596435 TUFT1: Information Gain = 0.30349431485227374 GAPDHP21: Information Gain = 0.30334089470016967 MOV10: Information Gain = 0.3031108913020306 BCL2: Information Gain = 0.30305833863962883 FLRT3: Information Gain = 0.302677769705896 CBLB: Information Gain = 0.3025725216820696 TRABD2A: Information Gain = 0.30223625603901283 MYO10: Information Gain = 0.302054278174376 MPV17L2: Information Gain = 0.3019493279604124 NDUFB1: Information Gain = 0.30187377585919495 WSB1: Information Gain = 0.3017518948438802 TEDC2: Information Gain = 0.3014144663041751 SDR16C5: Information Gain = 0.3013740670883802 OLFM1: Information Gain = 0.3012045072677103 KLF6: Information Gain = 0.3011681139644382 KPNA2: Information Gain = 0.30109539921754647 CEACAM5: Information Gain = 0.30089604482391996 PHTF1: Information Gain = 0.30080665810492313 ZNF84: Information Gain = 0.3007187828174662 SYT12: Information Gain = 0.3005385073544824 DHRS11: Information Gain = 0.30049312546269746 FDFT1: Information Gain = 0.29992512536029836 MYCBP: Information Gain = 0.2998768601470061 AZIN1: Information Gain = 0.2996308056460979 MYH9: Information Gain = 0.29931831760691674 ACOT7: Information Gain = 0.2992619104811354 DBI: Information Gain = 0.2986955430728764 TTC9: Information Gain = 0.2986518443237698 PPP1R10: Information Gain = 0.2983614160608137 MMP16: Information Gain = 0.29813438959166483 SLC25A10: Information Gain = 0.29807238362168365 SH3GL3: Information Gain = 0.29801330735278353 PSAP: Information Gain = 0.297848973075175 DMRTA1: Information Gain = 0.2976892950213874 ATXN1-AS1: Information Gain = 0.29754998990579473 UNC5B-AS1: Information Gain = 0.29747883171939304 LIMCH1: Information Gain = 0.2973255628283493 FANCG: Information Gain = 0.2972506588534811 AGPS: Information Gain = 0.2966435490167285 BCAS1: Information Gain = 0.29613102875794506 DGKD: Information Gain = 0.29604047815969725 ARL8A: Information Gain = 0.29603795929366883 KCNK5: Information Gain = 0.2959695124190611 PCAT1: Information Gain = 0.2954666553942149 MEIKIN: Information Gain = 0.2953160901274361 TPT1-AS1: Information Gain = 0.2952186439593394 CDK2AP1: Information Gain = 0.29472043161181305 ATXN1: Information Gain = 0.2946223913205934 GPR179: Information Gain = 0.2946153106824134 IFFO2: Information Gain = 0.2944336505932923 KLF11: Information Gain = 0.29420594988367155 ACAT2: Information Gain = 0.2940526832618515 PCP4L1: Information Gain = 0.2939103511675727 GPR146: Information Gain = 0.2938787975029069 MB: Information Gain = 0.2934957539438581 BEND5: Information Gain = 0.29334796276005104 BCL2L12: Information Gain = 0.29302253622902086 COPS9: Information Gain = 0.29258449329476344 DOLK: Information Gain = 0.2924016234559621 PCBP1: Information Gain = 0.29227980584110536 ELOVL5: Information Gain = 0.2922482511323048 SHISA5: Information Gain = 0.2918299340926278 PLOD2: Information Gain = 0.29174907058699673 CSNK1A1: Information Gain = 0.29172910169603483 RNF149: Information Gain = 0.2914991774376765 ATAD3A: Information Gain = 0.2910772157484438 ATF4: Information Gain = 0.2908437416511007 RPL31: Information Gain = 0.2907535224799449 PALLD: Information Gain = 0.2906072798392487 PLOD1: Information Gain = 0.2904364027222146 C1orf116: Information Gain = 0.2902859123379993 ADGRF4: Information Gain = 0.29020803697542075 HLA-W: Information Gain = 0.2901208249649754 GYS1: Information Gain = 0.2900655751474077 TMOD3: Information Gain = 0.2900337462110738 KCNG1: Information Gain = 0.29001031531549515 TPX2: Information Gain = 0.2900045871740078 PTEN: Information Gain = 0.2897363317419843 TAF9B: Information Gain = 0.28963165574572614 BOD1: Information Gain = 0.28943040336402515 EDA2R: Information Gain = 0.28873017518360733 CHRNA5: Information Gain = 0.2885129592685143 HSD17B10: Information Gain = 0.28841393482416455 MALL: Information Gain = 0.28830520230122025 HAUS8: Information Gain = 0.2879304072014175 GADD45A: Information Gain = 0.28788798520622416 B4GAT1: Information Gain = 0.28786909576387854 ARF6: Information Gain = 0.28785399181091953 ZFAND1: Information Gain = 0.28775967143399717 RAB6A: Information Gain = 0.2874887082939297 USP3-AS1: Information Gain = 0.2872057503382648 ELL2: Information Gain = 0.28713750463463317 RET: Information Gain = 0.286085243371917 ATF2: Information Gain = 0.28584917628151674 WDR45BP1: Information Gain = 0.28578903381019516 SIKE1: Information Gain = 0.28575175737510894 KRTAP5-2: Information Gain = 0.2855551688135123 PLIN5: Information Gain = 0.28545308035331174 GAS5: Information Gain = 0.2853596268712164 LRIG3: Information Gain = 0.28530859631222105 NRP1: Information Gain = 0.28529596134736823 GFRA1: Information Gain = 0.2850629302268133 CHAC2: Information Gain = 0.28486026578520596 ATXN3: Information Gain = 0.28455237222775964 TMEM104: Information Gain = 0.2845144601336218 ANKZF1: Information Gain = 0.28439414368713023 ULBP1: Information Gain = 0.2842713581229337 MICB: Information Gain = 0.2840357676990397 IFI35: Information Gain = 0.28382159796524187 HLA-E: Information Gain = 0.28370758459277146 PIK3R3: Information Gain = 0.2836388308339399 NFIL3: Information Gain = 0.283594120218551 PHF19: Information Gain = 0.2834300215909078 CLVS1: Information Gain = 0.28338661697840806 ATP1B1: Information Gain = 0.2831633454056546 CDC25A: Information Gain = 0.2830094930335638 IDI2-AS1: Information Gain = 0.28291140225447675 NDUFC2-KCTD14: Information Gain = 0.2828311697468151 KLHL24: Information Gain = 0.28233429824537315 FBXO32: Information Gain = 0.28231561566636043 TMEM229B: Information Gain = 0.282254847485333 TSPAN4: Information Gain = 0.28217420948011696 FCGRT: Information Gain = 0.28169831411937607 RAP1GAP: Information Gain = 0.2816418145251669 FAM167A: Information Gain = 0.2814093450832613 ENDOG: Information Gain = 0.28133759976424866 TMEM59: Information Gain = 0.2813076689087106 MVK: Information Gain = 0.2812700775642154 GAPDHP71: Information Gain = 0.2810268560807929 POLR3K: Information Gain = 0.2808673136747568 S100A13: Information Gain = 0.2808620932887578 FBXO38: Information Gain = 0.28079868060468494 LDLRAD1: Information Gain = 0.28026312962833044 MT-CO1: Information Gain = 0.2801851646398599 LAMC2: Information Gain = 0.28007836777009776 PPFIA4: Information Gain = 0.279971944903495 ANXA1: Information Gain = 0.27994197387265585 GDF15: Information Gain = 0.27993567211998105 IL3RA: Information Gain = 0.2797414069482893 GPAT3: Information Gain = 0.2797391504208724 SPC24: Information Gain = 0.2796221490503945 UBE2QL1: Information Gain = 0.27960256320645716 MIR6728: Information Gain = 0.27951575119843386 MALAT1: Information Gain = 0.27927849493832535 PLAAT2: Information Gain = 0.27922824268297086 ACTG1P10: Information Gain = 0.2790785764110386 MYL12-AS1: Information Gain = 0.27877409878308135 GOLM1: Information Gain = 0.2783071510387851 MIR1199: Information Gain = 0.2782619457947797 EIF4B: Information Gain = 0.2782392269290086 CYB561A3: Information Gain = 0.2778732456279105 PPM1K-DT: Information Gain = 0.27781180492837865 MRPL28: Information Gain = 0.277723926449845 CDCA7: Information Gain = 0.2776667445604355 CCDC74A: Information Gain = 0.2775895512791404 SLC25A39: Information Gain = 0.27758744922216283 C4orf47: Information Gain = 0.2774811660883547 ABHD15: Information Gain = 0.27746572127972846 ADM2: Information Gain = 0.277462548589982 PYGL: Information Gain = 0.2773830632561749 FRY: Information Gain = 0.27712361152247333 FUOM: Information Gain = 0.27697501464226093 FTLP3: Information Gain = 0.2769067871979518 GPER1: Information Gain = 0.2767797633696436 ZNF689: Information Gain = 0.27673180646210715 GALNT18: Information Gain = 0.2765688189486142 RPS27: Information Gain = 0.2765550083052586 MIR181A1HG: Information Gain = 0.27653945540651814 POLA2: Information Gain = 0.27645649318136 SCEL: Information Gain = 0.27644508610738927 FAM47E-STBD1: Information Gain = 0.2761198426022975 INSYN1-AS1: Information Gain = 0.2760825503175002 SAT1: Information Gain = 0.27601807909488985 FOXP1: Information Gain = 0.2759846960610546 SLC25A35: Information Gain = 0.2759237799206835 HLA-T: Information Gain = 0.2758338164292644 C6orf141: Information Gain = 0.27571457581725656 SERGEF: Information Gain = 0.27510515710005135 TRIM29: Information Gain = 0.27497537034386466 HAUS1: Information Gain = 0.2748742430144069 SPRR1A: Information Gain = 0.2747193031457549 APOBEC3A: Information Gain = 0.27465527841288706 SNTB1: Information Gain = 0.27435341508762967 RNF19A: Information Gain = 0.2742626712785603 YEATS2-AS1: Information Gain = 0.27424764417351577 ATIC: Information Gain = 0.27407261350536216 TMEM54: Information Gain = 0.2739833270241119 CENPM: Information Gain = 0.2737275064007094 P3R3URF-PIK3R3: Information Gain = 0.2736094211579705 GPR155: Information Gain = 0.2733991791469623 RYR2: Information Gain = 0.27333990766439253 SERINC3: Information Gain = 0.27333439570270923 CD9: Information Gain = 0.2733339438288158 CCN4: Information Gain = 0.2732647792037217 MAOB: Information Gain = 0.2730994873583439 RPL7: Information Gain = 0.27309181216141876 TNFRSF19: Information Gain = 0.2730729341908622 LDHAP5: Information Gain = 0.27281341231407175 LRP4: Information Gain = 0.27276394495011314 LPP: Information Gain = 0.2726877939672576 LNPK: Information Gain = 0.2725022984947152 NDUFA4L2: Information Gain = 0.2724591727373227 CAST: Information Gain = 0.2722637510089314 CISD3: Information Gain = 0.27222659330117605 CCSAP: Information Gain = 0.27207630879822164 NAPRT: Information Gain = 0.27192074583119896 METTL7A: Information Gain = 0.27186482517859534 CPEB2: Information Gain = 0.27149418474970255 WDR4: Information Gain = 0.27130327620672334 FTH1P20: Information Gain = 0.27122268636989655 TBC1D8B: Information Gain = 0.2710178806369812 SCARB1: Information Gain = 0.2710060283791378 FAM210A: Information Gain = 0.27100056494115865 PLD1: Information Gain = 0.27098811331982486 CDK5R2: Information Gain = 0.2709189323798129 MTHFD1: Information Gain = 0.2708245218703944 XPOT: Information Gain = 0.27082378271435736 PPP1R3C: Information Gain = 0.27070757068813345 MCM3: Information Gain = 0.2706744684031319 RPL23AP7: Information Gain = 0.2706666920899228 PPP1R14C: Information Gain = 0.270628793362899 TPD52L1: Information Gain = 0.2706237500847388 UNC5B: Information Gain = 0.2705994769100353 FUT3: Information Gain = 0.27058340696118544 JPH2: Information Gain = 0.2705722712696348 SAMD4A: Information Gain = 0.2704497995131947 IGFLR1: Information Gain = 0.27028743243626696 MUC16: Information Gain = 0.27019023099775286 HLA-L: Information Gain = 0.27011851151874344 MRNIP: Information Gain = 0.2699678326861574 ZNF365: Information Gain = 0.26989961038404453 RCN1P2: Information Gain = 0.26986592799502596 RAPGEFL1: Information Gain = 0.26970465917339914 ADAT1: Information Gain = 0.269668358453655 HINT3: Information Gain = 0.26962279661774535 SLC7A11: Information Gain = 0.26954837406896903 RIBC2: Information Gain = 0.2695041333367101 SAMHD1: Information Gain = 0.2694686015826606 GAL: Information Gain = 0.26889526106001127 CXADR: Information Gain = 0.26883249667961273 HSD17B1-AS1: Information Gain = 0.26865426447537266 SMAP1: Information Gain = 0.2685879042865864 ELOVL2-AS1: Information Gain = 0.2685606160497007 LOX: Information Gain = 0.2685120639738079 SHMT1: Information Gain = 0.26847669675550456 KRT83: Information Gain = 0.2684719289962618 NUP62CL: Information Gain = 0.2683971804611138 SPATS2L: Information Gain = 0.2683304311391774 RECQL4: Information Gain = 0.2682228336783341 TKT: Information Gain = 0.2682165828581684 PWWP3B: Information Gain = 0.2680711004412595 INSYN1: Information Gain = 0.26801295082488785 A4GALT: Information Gain = 0.2679370460447652 STING1: Information Gain = 0.26791495656248876 KRTAP5-AS1: Information Gain = 0.2679015189557685 SRPX: Information Gain = 0.26788444019989144 TBC1D3L: Information Gain = 0.26785704007027844 AGMAT: Information Gain = 0.2674493404704301 FRK: Information Gain = 0.26718387439791424 LATS1: Information Gain = 0.2671765825654524 KRT224P: Information Gain = 0.2670133773593706 GRM4: Information Gain = 0.2669191879132642 HOXA10: Information Gain = 0.26677228575890566 PDGFB: Information Gain = 0.2665151606225167 EIF2B3: Information Gain = 0.266485374868352 PACSIN2: Information Gain = 0.26638352435583723 PPM1J: Information Gain = 0.26636598066985906 ST8SIA6-AS1: Information Gain = 0.2661689850786495 RNPEP: Information Gain = 0.266132046843891 CBX5: Information Gain = 0.26611636925031856 PNMA2: Information Gain = 0.26587747696327435 ANXA2R: Information Gain = 0.26586361569191364 PAK6: Information Gain = 0.2657868438263389 GAPDHP73: Information Gain = 0.2657797301792528 EGFR: Information Gain = 0.26571650912922107 FAM111B: Information Gain = 0.26563558960574274 CDKN2AIPNL: Information Gain = 0.26559940324945663 SOGA3: Information Gain = 0.2655402568457026 MCM10: Information Gain = 0.26547568922346954 CD109: Information Gain = 0.2654202016700278 CDC20: Information Gain = 0.26516439256061664 AHR: Information Gain = 0.26511512636575296 HOXA13: Information Gain = 0.26473274528467106 KMT5B: Information Gain = 0.26458030693325285 GAPDHP64: Information Gain = 0.26456750040151666 C15orf65: Information Gain = 0.2645401070197868 FAM214B: Information Gain = 0.2643437475577086 SLC25A15: Information Gain = 0.26428543137815175 S100P: Information Gain = 0.2642798546794567 GAPDHP69: Information Gain = 0.26399190888642043 RIPPLY3: Information Gain = 0.2639675797430987 RAB3IL1: Information Gain = 0.2639308193270762 ALDOAP1: Information Gain = 0.263837402464538 MCRIP2P1: Information Gain = 0.26374732314647864 SLC26A5: Information Gain = 0.2636596468327064 SQSTM1: Information Gain = 0.2634830552261458 TCP11L2: Information Gain = 0.2634748490729 NDUFB10: Information Gain = 0.26347133331276207 POMGNT1: Information Gain = 0.26342751221199956 WDR76: Information Gain = 0.2633516269245191 CHTF8: Information Gain = 0.26334661348193045 OTULINL: Information Gain = 0.2630369569385196 LRATD1: Information Gain = 0.26301356216611516 WDR61: Information Gain = 0.2629718771551306 TTC36: Information Gain = 0.26294214201273536 DPF1: Information Gain = 0.2629234491848389 CFDP1: Information Gain = 0.26291322873594924 ETNK2: Information Gain = 0.26283961517666143 MIR7844: Information Gain = 0.26283889242301717 PARP1: Information Gain = 0.26276813247299 ADGRF1: Information Gain = 0.2626112477871494 IRF6: Information Gain = 0.2625689304528829 LINC00623: Information Gain = 0.2625334334803884 MTCO3P12: Information Gain = 0.2624894537535942 GAPDHP35: Information Gain = 0.2624746781422316 MFSD13A: Information Gain = 0.26246431321333885 ARMC6: Information Gain = 0.26245625092006586 GET1-SH3BGR: Information Gain = 0.2623876112128516 CD320: Information Gain = 0.26231140958174004 MTHFD2: Information Gain = 0.2622930019403662 VAPA: Information Gain = 0.26224578497575846 MIF: Information Gain = 0.2622259267751499 ZNF367: Information Gain = 0.2622067136252557 ZNF148: Information Gain = 0.26217417169400625 SEMA4B: Information Gain = 0.26198125128610794 NECTIN3-AS1: Information Gain = 0.2619539111813203 PCCA-DT: Information Gain = 0.2619133352160974 KCND3: Information Gain = 0.26172317985122007 CAVIN1: Information Gain = 0.261715295201298 ATP5F1A: Information Gain = 0.26169628794437116 PCLAF: Information Gain = 0.26161069246146873 DAPK2: Information Gain = 0.26153577440333664 SLC1A1: Information Gain = 0.26151576022504397 DCAF10: Information Gain = 0.261363261227344 E2F2: Information Gain = 0.26135226966343406 GAS5-AS1: Information Gain = 0.2612816665776385 PPP1R14B-AS1: Information Gain = 0.26117530733960614 XPOTP1: Information Gain = 0.2611103539615405 H3C4: Information Gain = 0.26106165167219775 MRPL38: Information Gain = 0.2610600135754435 GOLGA6L10: Information Gain = 0.26092294354913226 NRGN: Information Gain = 0.2608882588856587 DTL: Information Gain = 0.26086206442982296 HSD17B1: Information Gain = 0.2607906410567955 RGCC: Information Gain = 0.260776672183884 AIFM1: Information Gain = 0.2607430726686919 SNHG22: Information Gain = 0.2605938256697029 MRPL41: Information Gain = 0.26058771359074084 NT5DC2: Information Gain = 0.26055371616864154 CYP4F22: Information Gain = 0.26048301521679074 BEST4: Information Gain = 0.2604471166715654 NKAIN1: Information Gain = 0.2603064311287244 POLD1: Information Gain = 0.2602431687305602 TUBA3E: Information Gain = 0.26016613557879986 KLF13: Information Gain = 0.2601652403395218 LINC01214: Information Gain = 0.2601495462207397 GIHCG: Information Gain = 0.260081036307515 STXBP5-AS1: Information Gain = 0.26006180211843066 CDKN3: Information Gain = 0.2599372481496778 TARS1: Information Gain = 0.2598808850300307 APOL4: Information Gain = 0.2598368372636737 H4C5: Information Gain = 0.25968214666720435 ZNF337: Information Gain = 0.2596069261190588 DHCR24: Information Gain = 0.25954409417646174 PPP2R5B: Information Gain = 0.25953154521978505 PARK7: Information Gain = 0.2594820394582502 CLPSL2: Information Gain = 0.25944520313054675 RTN4RL1: Information Gain = 0.25940848301644737 RNF144A: Information Gain = 0.2593789825581869 FAM86C1P: Information Gain = 0.2593615875143245 AKR1C1: Information Gain = 0.25934155932255964 H2AC7: Information Gain = 0.25931391711131346 EDN1: Information Gain = 0.25922439577914 CBX4: Information Gain = 0.2592018393409252 MIF-AS1: Information Gain = 0.2591724213226312 MAP4K2: Information Gain = 0.2589883354112861 COA8: Information Gain = 0.2589623261795291 IFI30: Information Gain = 0.2589016054554596 BRCA1: Information Gain = 0.2588803441986389 GON7: Information Gain = 0.25876867986184027 RBBP7: Information Gain = 0.25872781739517414 SORL1: Information Gain = 0.25869609049530995 BSCL2: Information Gain = 0.2585453814872547 KRT4: Information Gain = 0.2585029032664219 FGF2: Information Gain = 0.25846692118812165 CDK5: Information Gain = 0.25842636078140746 DMC1: Information Gain = 0.25838924365510496 TUBA4A: Information Gain = 0.2583630279624467 FKBP5: Information Gain = 0.2583341647908879 CCDC107: Information Gain = 0.2582591217234478 H2AC9P: Information Gain = 0.25824760974434735 TMEM74B: Information Gain = 0.2581296005405471 NPC1L1: Information Gain = 0.2580660338711631 NDUFA4: Information Gain = 0.2579507081468715 DRAXIN: Information Gain = 0.25792846995014385 TMEM19: Information Gain = 0.2578648538242809 BMF: Information Gain = 0.2578379382620135 PLEKHG1: Information Gain = 0.2578312249077477 RNF180: Information Gain = 0.2577980879906805 HYMAI: Information Gain = 0.2575080693545433 IFI44: Information Gain = 0.2574102060566896 ARID5A: Information Gain = 0.2573344610623354 PLK1: Information Gain = 0.25732496481268874 CEACAM6: Information Gain = 0.25732000000867594 DNASE1L2: Information Gain = 0.25730241387634134 EEF1A1: Information Gain = 0.25726466271272974 TPSP2: Information Gain = 0.25715090030266197 STBD1: Information Gain = 0.25711061264468515 ZNF528-AS1: Information Gain = 0.25707207791649944 CYRIA: Information Gain = 0.25689805828412915 ENO1P1: Information Gain = 0.2568551819194749 ITGB3BP: Information Gain = 0.25682054236428886 HDHD5-AS1: Information Gain = 0.25672813892366886 TNFRSF18: Information Gain = 0.2566193394329608 SPATA18: Information Gain = 0.25653833152148975 TLCD1: Information Gain = 0.2564550249350461 SNTA1: Information Gain = 0.25642733569281684 MED15: Information Gain = 0.25636897402013537 ZNF682: Information Gain = 0.25606513414719245 AZIN2: Information Gain = 0.2560584761868394 HEATR6: Information Gain = 0.256033918539905 ENOX1: Information Gain = 0.25595366865609215 RNU1-82P: Information Gain = 0.255897863786942 ADRA2A: Information Gain = 0.25585309228671704 CCDC33: Information Gain = 0.25571639445211614 AMPD3: Information Gain = 0.25566919660306775 TNFRSF6B: Information Gain = 0.25559930291289046 HIGD1AP1: Information Gain = 0.2553839469424519 PLEKHO1: Information Gain = 0.2553101998890821 TLE6: Information Gain = 0.255220096358447 ACTBP15: Information Gain = 0.25520234667051245 MITF: Information Gain = 0.25515253987196607 PKDCC: Information Gain = 0.2549932563848616 ARFRP1: Information Gain = 0.25492483093829654 FTH1P12: Information Gain = 0.2549213619167505 MIR210: Information Gain = 0.2549022530376841 MEF2A: Information Gain = 0.25489820765025173 REEP2: Information Gain = 0.2548832615994545 OTX1: Information Gain = 0.2548085094936017 VXN: Information Gain = 0.25475498285944753 SLK: Information Gain = 0.25471170883422967 PARM1: Information Gain = 0.2545795133591311 TSPAN12: Information Gain = 0.2545592977517994 NIBAN1: Information Gain = 0.25451738955894987 TOX2: Information Gain = 0.25447901381712246 CFAP418-AS1: Information Gain = 0.2544155786229587 MYBL1: Information Gain = 0.25430824556168274 MIR34AHG: Information Gain = 0.254289640234179 SINHCAFP1: Information Gain = 0.25422625328191084 GLUD1P3: Information Gain = 0.25420163536573126 FTH1P15: Information Gain = 0.25399151303071177 ANAPC5: Information Gain = 0.25394233033138325 G6PC3: Information Gain = 0.2538309888632868 CASTOR3: Information Gain = 0.25382795178544804 BTG1-DT: Information Gain = 0.253803602429747 TPM4: Information Gain = 0.25360400360363755 CYFIP2: Information Gain = 0.2535144418196691 DPAGT1: Information Gain = 0.25351128208898555 GATA2: Information Gain = 0.25348625612874787 ASNS: Information Gain = 0.25335502057379466 SEL1L: Information Gain = 0.25315175188112726 RUSC1: Information Gain = 0.2531214197689864 RN7SL674P: Information Gain = 0.25308008285098693 RCN3: Information Gain = 0.25298384077634917 CALM3: Information Gain = 0.252969408466841 ABHD8: Information Gain = 0.2529425493863364 LPIN3: Information Gain = 0.252869690728168 ZMPSTE24-DT: Information Gain = 0.2528365753723889 DNAAF10: Information Gain = 0.25279370495677855 SNW1: Information Gain = 0.25279106755565883 S100A4: Information Gain = 0.25272116126319766 LSS: Information Gain = 0.25271784685136867 DSC2: Information Gain = 0.2526880117367696 EGFR-AS1: Information Gain = 0.252542042113169 DUSP2: Information Gain = 0.2524768358181293 MLKL: Information Gain = 0.2524625508062326 C21orf58: Information Gain = 0.25242126871876747 CRYBG3: Information Gain = 0.25226409025065344 POLE2: Information Gain = 0.2520793748143195 STX3: Information Gain = 0.2520580764428526 LERFS: Information Gain = 0.25198471366165176 EXOG: Information Gain = 0.2519532656282759 TOP2A: Information Gain = 0.25185928565477056 PLBD1-AS1: Information Gain = 0.25185860525908454 NAV1: Information Gain = 0.251856415211378 ATP6V1G1: Information Gain = 0.25182947324980565 TK1: Information Gain = 0.2518265705579086 CFAP251: Information Gain = 0.2517743816455025 TPTE2: Information Gain = 0.2516430162997796 CAVIN2: Information Gain = 0.2515064715935016 KRT19: Information Gain = 0.2514873694041604 CLEC3A: Information Gain = 0.25139535782468236 RELN: Information Gain = 0.2513713502378456 EGR3: Information Gain = 0.2513295198801686 HMGN3: Information Gain = 0.25129694953908377 HES2: Information Gain = 0.25120546093347884 DUSP8: Information Gain = 0.2511739822704 KIF5B: Information Gain = 0.25105445078291755 MCM6: Information Gain = 0.25094091578110245 HOXA10-AS: Information Gain = 0.2509402499128883 EFEMP2: Information Gain = 0.25091698280952546 CALR4P: Information Gain = 0.25086118345482156 DNER: Information Gain = 0.2508488983641215 BMF-AS1: Information Gain = 0.25082371651556823 GAPDHP68: Information Gain = 0.2507564096698707 SERPINE2: Information Gain = 0.2507188445182167 FBP1: Information Gain = 0.25068640030406875 BMS1P10: Information Gain = 0.25063886848766503 KRT18P46: Information Gain = 0.25056640668990493 MMP13: Information Gain = 0.25055276210213284 GAPDHP32: Information Gain = 0.2505192997886956 ADAMTS9-AS2: Information Gain = 0.2503647258807713 KBTBD2: Information Gain = 0.2503367570449522 SERTAD2: Information Gain = 0.2503324564654821 RGS20: Information Gain = 0.25031113823882767 C2CD2: Information Gain = 0.2502969663194057 MIR7113: Information Gain = 0.25026484094388346 PPP1R3E: Information Gain = 0.25019572808285795 ARID3A: Information Gain = 0.25005688482750843 ERICH6-AS1: Information Gain = 0.24992737452815517 STAG3: Information Gain = 0.24986176050500886 RAMP2: Information Gain = 0.24979395299735563 LRP4-AS1: Information Gain = 0.24978877156021206 GPR139: Information Gain = 0.24978287519963582 SYNE3: Information Gain = 0.2497686320343837 CPA6: Information Gain = 0.2496866903015571 GLRA3: Information Gain = 0.24951886232946818 ERLNC1: Information Gain = 0.2495002502945829 EEF1A1P13: Information Gain = 0.24935505458842488 WSCD1: Information Gain = 0.24933922253041718 PTTG1IP: Information Gain = 0.2491458001336242 SDK1-AS1: Information Gain = 0.249044509543362 FLOT2: Information Gain = 0.24892132963445324 MFSD11: Information Gain = 0.24889488091891554 TOX3: Information Gain = 0.24882300461383955 PLXNA2: Information Gain = 0.24877200147015777 TNNT1: Information Gain = 0.24869560962514337 PHLDB2: Information Gain = 0.24866869026688798 LIN7A: Information Gain = 0.248622745440084 IDS: Information Gain = 0.248599739920095 ANXA3: Information Gain = 0.24856346230153847 SCGB2A1: Information Gain = 0.24854435500586436 DHX40: Information Gain = 0.24847001656476397 GLIDR: Information Gain = 0.2484643202850607 IL17RB: Information Gain = 0.2483320438636627 KRT16: Information Gain = 0.2483029630227287 ANK2: Information Gain = 0.24827561277898758 CHAF1B: Information Gain = 0.24825734852735426 ZMAT4: Information Gain = 0.24822845844753538 CYB5B: Information Gain = 0.24815341814701353 SRD5A3-AS1: Information Gain = 0.24814017995767546 SLC47A1: Information Gain = 0.24808639786792197 SPA17: Information Gain = 0.2480627202086385 LRP2: Information Gain = 0.2480354338882762 ACTG1P12: Information Gain = 0.24792471921538106 SMIM15: Information Gain = 0.24792055052278839 NAXE: Information Gain = 0.24789529673023214 ZNF524: Information Gain = 0.24786576265489635 THEG: Information Gain = 0.24786164775243602 RANGRF: Information Gain = 0.2478589861653362 FNDC10: Information Gain = 0.24784370604918116 ISOC1: Information Gain = 0.24780862974264872 TRIM16L: Information Gain = 0.24779957732344893 GPRC5A: Information Gain = 0.24773820089944776 MID1: Information Gain = 0.24769986799681454 ERRFI1: Information Gain = 0.24767831237714555 CCDC71: Information Gain = 0.24762081388256418 MLEC: Information Gain = 0.2476188497069094 TONSL: Information Gain = 0.24758223037283633 CCR3: Information Gain = 0.24757386624598676 COL9A2: Information Gain = 0.24753187407415655 C1QTNF6: Information Gain = 0.2474992541739427 COL17A1: Information Gain = 0.2474431866391309 TM7SF2: Information Gain = 0.24731925279566358 SYNGR3: Information Gain = 0.24731825303892374 KHDC1: Information Gain = 0.24729100391234016 RGS17: Information Gain = 0.24727714177218596 C1R: Information Gain = 0.2471836493274846 ACSS1: Information Gain = 0.24715593668601432 TENM3-AS1: Information Gain = 0.24715310820720826 SERINC1: Information Gain = 0.24712296028929415 LINC01659: Information Gain = 0.2470604359243609 FOXRED1: Information Gain = 0.2470452640735621 MUC12-AS1: Information Gain = 0.24702283750942833 FTH1P7: Information Gain = 0.24691272274668097 HERC3: Information Gain = 0.24689651516580557 TATDN1P1: Information Gain = 0.24686701732789262 KRT17: Information Gain = 0.24683117474196248 NUAK1: Information Gain = 0.24682772877741788 PGLYRP2: Information Gain = 0.24677170482703503 MCUB: Information Gain = 0.24674147099496557 MYORG: Information Gain = 0.24660831325584565 ACTR3C: Information Gain = 0.24648125393082743 TMCC3: Information Gain = 0.24635959829949616 NPY1R: Information Gain = 0.24622830701972265 LRRC45: Information Gain = 0.24617243115157184 BLNK: Information Gain = 0.24616957706747855 NAMPTP1: Information Gain = 0.24615151950972147 MIR3917: Information Gain = 0.24608911970121738 CSTF3: Information Gain = 0.2460340080491099 FOXP2: Information Gain = 0.24601708134383204 FOXI3: Information Gain = 0.2459613487242629 GAPDHP44: Information Gain = 0.24590084654076105 YPEL5: Information Gain = 0.24584465255417287 RN7SL1: Information Gain = 0.24575169372648964 PRKAA2: Information Gain = 0.24567323599873658 SPATA12: Information Gain = 0.24545635496684493 PTPRR: Information Gain = 0.24545507119223275 COQ4: Information Gain = 0.24542226662594802 DPCD: Information Gain = 0.24536687629178555 CCND3: Information Gain = 0.24523069900878802 ARHGEF28: Information Gain = 0.24512182233871127 MKRN4P: Information Gain = 0.24506969164666748 TMEM45B: Information Gain = 0.24504738461367914 ATP6AP1L: Information Gain = 0.2449908102859597 MIR6819: Information Gain = 0.24496471188852875 FTH1P8: Information Gain = 0.24494051297494113 SBK1: Information Gain = 0.2449287834449423 SUOX: Information Gain = 0.24491499544640982 MEAF6: Information Gain = 0.24487018338882405 MAGEF1: Information Gain = 0.2448694827098863 ATP5MG: Information Gain = 0.244854807562914 RBP7: Information Gain = 0.24474188145401854 MAB21L3: Information Gain = 0.2446786504243803 GALR2: Information Gain = 0.2446604468068374 WASF4P: Information Gain = 0.24462178582580263 ARL6IP1P2: Information Gain = 0.2446019633679959 SARS1: Information Gain = 0.2445966725363582 MIR6811: Information Gain = 0.24457768821298687 ZNF766: Information Gain = 0.24452335544904957 DOCK11: Information Gain = 0.24448815848220473 CHST14: Information Gain = 0.24442249999217736 NUDT6: Information Gain = 0.2444129995798876 ECI1: Information Gain = 0.24431868934838485 SOWAHC: Information Gain = 0.2443178084602311 TOMM40P2: Information Gain = 0.24424363377821945 SEPHS1P4: Information Gain = 0.24414518497759063 RPS12P26: Information Gain = 0.2441150880535503 HSPB1P2: Information Gain = 0.2440886171984169 LONRF2: Information Gain = 0.24406980012797352 THEMIS2: Information Gain = 0.24406636917219582 CNPY4: Information Gain = 0.24398523653229964 DTYMK: Information Gain = 0.243983459122306 ABCB8: Information Gain = 0.2439529536335583 TMEM132B: Information Gain = 0.24388615213609666 HS6ST3: Information Gain = 0.2438318665213861 SOD2-OT1: Information Gain = 0.24382830060664773 ID2-AS1: Information Gain = 0.24378743599418717 ETV6: Information Gain = 0.2437093602035718 CCDC74B: Information Gain = 0.24366562690877447 DPT: Information Gain = 0.24365950761627309 CSGALNACT1: Information Gain = 0.24365153582849053 KCNN1: Information Gain = 0.24355700311347728 ZNF70: Information Gain = 0.2435468639997509 TIGD3: Information Gain = 0.24351609671637076 RHPN1-AS1: Information Gain = 0.2434849233533931 MALRD1: Information Gain = 0.24347610421096677 KRT89P: Information Gain = 0.2434514779810415 DACT3-AS1: Information Gain = 0.2434238135052158 PPP1R3B: Information Gain = 0.24342353421713603 CHAC1: Information Gain = 0.24338906798077486 ATG14: Information Gain = 0.2433754042396843 SEPSECS-AS1: Information Gain = 0.24335422852013555 ARHGEF35-AS1: Information Gain = 0.24328824624722278 IL17D: Information Gain = 0.24319964545901795 STMN4: Information Gain = 0.2431991000098015 DEPDC4: Information Gain = 0.2431570299167174 GINS1: Information Gain = 0.2431060153208553 MRTFA: Information Gain = 0.24294793064351428 MUC5B-AS1: Information Gain = 0.24293735805468253 LRG1: Information Gain = 0.24293615068394625 AXL: Information Gain = 0.2429185410625616 MCOLN3: Information Gain = 0.24289864768455582 OR2A9P: Information Gain = 0.24271949200042364 TNFRSF10B: Information Gain = 0.2427188107048095 MELTF: Information Gain = 0.24269685592380075 PTH1R: Information Gain = 0.24263054399885986 ZNF264: Information Gain = 0.24258303803765902 RTL8B: Information Gain = 0.242565986585922 MIR6830: Information Gain = 0.24253533227578883 DTNA: Information Gain = 0.24249583638201577 PKD1P6: Information Gain = 0.24249092986577114 OPLAH: Information Gain = 0.2424595925470232 FGD2: Information Gain = 0.24241172840572633 SUMO3: Information Gain = 0.24237865328380437 IGHE: Information Gain = 0.24237094691127625 ANXA2: Information Gain = 0.24236232697644589 CDYL: Information Gain = 0.24232496351526622 LINC01615: Information Gain = 0.2423067613893226 MRPL12: Information Gain = 0.24229740688538448 ASPM: Information Gain = 0.24227833832343482 CDC6: Information Gain = 0.24225926665669584 GTSE1: Information Gain = 0.24223185585656193 IFNAR2: Information Gain = 0.24222975143953174 FAS: Information Gain = 0.24222944690931514 UMODL1: Information Gain = 0.2421623236136552 SH3RF2: Information Gain = 0.2421504501205698 DIPK2A: Information Gain = 0.24206574024666505 E2F1: Information Gain = 0.24205173055049678 CORO1C: Information Gain = 0.24203233926590917 CDC42EP2: Information Gain = 0.24202508753041063 RUNX2: Information Gain = 0.24201327753927648 CCL22: Information Gain = 0.24198715109136626 MDK: Information Gain = 0.2419195907355769 MIR4743: Information Gain = 0.24190335861019197 GRPEL2: Information Gain = 0.24188785003571378 PALM2AKAP2: Information Gain = 0.24187499308519467 RAB37: Information Gain = 0.24186648415019896 SVIL: Information Gain = 0.2418331584876361 MAP7D2: Information Gain = 0.24170269658723686 PPP2CA-DT: Information Gain = 0.24164272755544647 NAGS: Information Gain = 0.24156920417340055 EMID1: Information Gain = 0.24147075980024657 C1QTNF7-AS1: Information Gain = 0.24143564829260744 GREB1: Information Gain = 0.24139407259531875 RNF41: Information Gain = 0.24137902632608998 NUDT1: Information Gain = 0.24136036996836752 SOX11: Information Gain = 0.24133374573439026 IFRD1: Information Gain = 0.2413154143743652 PPP1CB: Information Gain = 0.2412922568833804 CDH11: Information Gain = 0.24123048707862194 MIR761: Information Gain = 0.241229988893519 ZBTB20-AS1: Information Gain = 0.24121089680672636 ZDHHC9: Information Gain = 0.24120986094820585 PDGFC: Information Gain = 0.24114249165642487 ADPRH: Information Gain = 0.24113212530093908 CPLANE2: Information Gain = 0.24109235118591266 RNU6-8: Information Gain = 0.24103300019442853 CYBA: Information Gain = 0.24102782298878678 TMCO3: Information Gain = 0.24092495913103162 RFX3-AS1: Information Gain = 0.2408675259990367 S1PR5: Information Gain = 0.2408065355440523 PKD2: Information Gain = 0.24074506622407688 FTH1P11: Information Gain = 0.24072065789581298 GOLGA2P5: Information Gain = 0.24070495979656403 ZNF610: Information Gain = 0.2406276083824883 MIR3198-2: Information Gain = 0.2405905291450514 DSCAM: Information Gain = 0.24055367914927372 SMARCE1P5: Information Gain = 0.24052738229100568 LIF: Information Gain = 0.240447126997833 CAVIN2-AS1: Information Gain = 0.24043490168646597 LINC00526: Information Gain = 0.2404319632938534 CHML: Information Gain = 0.24037383537933343 SPTBN4: Information Gain = 0.2403107525131638 LINC00598: Information Gain = 0.24022177729758654 LNC-LBCS: Information Gain = 0.24013693661200297 C12orf60: Information Gain = 0.2400221372919782 CLGN: Information Gain = 0.23998849951225676 ARL2BPP4: Information Gain = 0.23996953882175265 KCTD11: Information Gain = 0.23993501502118542 CXCR4: Information Gain = 0.23992450525135367 ASPH: Information Gain = 0.23989643415348194 KIF4A: Information Gain = 0.23987853784208468 SKA3: Information Gain = 0.23981006240540315 HS3ST1: Information Gain = 0.23979672347933323 C19orf38: Information Gain = 0.23978956421965947 GRIN2C: Information Gain = 0.2397878943684082 CDKL2: Information Gain = 0.2397734154710771 SPRR1B: Information Gain = 0.23970548838346883 CENPX: Information Gain = 0.2396402433608844 DRAIC: Information Gain = 0.23961612969445145 NCMAP-DT: Information Gain = 0.23959014121124222 PAOX: Information Gain = 0.23952058375573304 YBX2: Information Gain = 0.23947493108653517 SEPTIN11: Information Gain = 0.23947387449057045 FCHO2-DT: Information Gain = 0.23935894083917875 LNX2: Information Gain = 0.2393201252301993 ZRANB1: Information Gain = 0.23928992717933162 NEK9: Information Gain = 0.23925995636978103 CEP19: Information Gain = 0.23916674893010237 LPAR3: Information Gain = 0.23911807907166227 NR3C1: Information Gain = 0.2390874908127547 WEE2: Information Gain = 0.23907931898054247 STMN1: Information Gain = 0.23905657335783292 OTOS: Information Gain = 0.23903516669235736 MIF4GD: Information Gain = 0.23898345947423727 NPEPPSP1: Information Gain = 0.23898275737812424 FAM177B: Information Gain = 0.23897891830996754 SIPA1L2: Information Gain = 0.23896821746677133 TMEM105: Information Gain = 0.23895927415153184 LINC02889: Information Gain = 0.23894434107541174 ANKRD22: Information Gain = 0.23894130574983463 PXDC1: Information Gain = 0.23892327058294427 GAMT: Information Gain = 0.23891690179067115 ISM2: Information Gain = 0.23884457954500116 TMPRSS9: Information Gain = 0.23881978368042756 FTH1P2: Information Gain = 0.23880428629135309 ARHGEF34P: Information Gain = 0.23872530215740007 GDAP1: Information Gain = 0.23866152616604008 NF2: Information Gain = 0.23857938739071916 SPRED1: Information Gain = 0.23848852074389537 BTC: Information Gain = 0.23846497956561263 TRIM60P18: Information Gain = 0.23840118963960855 MEX3D: Information Gain = 0.23834088404862852 IFI16: Information Gain = 0.23830656094966685 GDPD3: Information Gain = 0.23829516021764463 NAV2: Information Gain = 0.2382463719009229 MIR636: Information Gain = 0.23820452275806914 HSD17B14: Information Gain = 0.23815660851975884 CLPSL1: Information Gain = 0.23814587932737608 KCNJ8: Information Gain = 0.238132787204677 GSC: Information Gain = 0.23812298905882634 PCAT7: Information Gain = 0.2381218509683869 LINC00636: Information Gain = 0.23809906653897817 PRRC1: Information Gain = 0.23807386989319235 HSH2D: Information Gain = 0.23806249603727725 TIMELESS: Information Gain = 0.23805549167841145 CREB5: Information Gain = 0.23800618352892355 TRAV18: Information Gain = 0.2379902999503103 PHC2-AS1: Information Gain = 0.23797257375761527 PTGFRN: Information Gain = 0.23796976887233812 PRELID1: Information Gain = 0.2379297598248329 SEMA6C: Information Gain = 0.23777863335181104 PAG1: Information Gain = 0.23777382195360475 OR7E39P: Information Gain = 0.23777246495086168 GLT1D1: Information Gain = 0.23776055683180486 AGBL2: Information Gain = 0.23774843491220343 FAM178B: Information Gain = 0.23774283925390538 ST13P6: Information Gain = 0.23772827417015363 LHX2: Information Gain = 0.2376805312321537 ZNNT1: Information Gain = 0.23760024400805424 HSPB1P1: Information Gain = 0.23759926272158194 CORO1A-AS1: Information Gain = 0.2375602659481697 THRIL: Information Gain = 0.23754919867793478 SNRPGP15: Information Gain = 0.23753792983117394 C2CD4C: Information Gain = 0.23751637677693038 DDX59: Information Gain = 0.23751408944362318 NPY5R: Information Gain = 0.23751311003622644 FYB2: Information Gain = 0.23749218699302888 MAP1A: Information Gain = 0.2374831447535506 COL13A1: Information Gain = 0.23748212122914336 ID4: Information Gain = 0.2374450569784332 IL12A-AS1: Information Gain = 0.23743069670453387 TAGAP-AS1: Information Gain = 0.2373985740200193 LINC00824: Information Gain = 0.23738477943777125 GOLGA5: Information Gain = 0.23737153318966442 GCNT3: Information Gain = 0.23736754743153932 OR7E126P: Information Gain = 0.23736094808905883 FDX2: Information Gain = 0.23734191586779851 KCTD17: Information Gain = 0.23731800638533262 PRICKLE2-DT: Information Gain = 0.23729321730140063 GBX2: Information Gain = 0.23725309092781877 EDARADD: Information Gain = 0.23722195452920602 IL20: Information Gain = 0.23720942100021558 FAM230I: Information Gain = 0.23720322351636125 MIR6785: Information Gain = 0.23719219093372335 RPL7P6: Information Gain = 0.23718146987514555 NUSAP1: Information Gain = 0.2371367562106823 CMKLR2: Information Gain = 0.23712266517254932 LRRC3: Information Gain = 0.2371001749770143 MAF: Information Gain = 0.2370461363178531 C14orf132: Information Gain = 0.23702971688368235 TNIK: Information Gain = 0.23700778159979508 DINOL: Information Gain = 0.23700376793126066 DNAH10OS: Information Gain = 0.23700207320299338 ARIH1: Information Gain = 0.23699417466945505 FGF13: Information Gain = 0.23697182300404918 RPL7P47: Information Gain = 0.23692857407734813 SWAP70: Information Gain = 0.23673224350368738 HS6ST2: Information Gain = 0.2367183397206014 LINC01977: Information Gain = 0.2366813629778881 LINC00629: Information Gain = 0.23667889747847792 LINC00866: Information Gain = 0.23667436652482987 MIR6765: Information Gain = 0.23665679150495955 ZNF304: Information Gain = 0.23665578251731723 PEX5: Information Gain = 0.23663812325405775 THRSP: Information Gain = 0.2365808727175831 FTH1P5: Information Gain = 0.23656997860861484 CDKN1A: Information Gain = 0.23650485411983602 STAB1: Information Gain = 0.2364818275796634 PHGDH: Information Gain = 0.23648096360710835 LINC01340: Information Gain = 0.23646319387558923 MCM7: Information Gain = 0.23645408094351605 ALOX5: Information Gain = 0.23642469020459345 ZMYM5: Information Gain = 0.2364217974831211 DCLK2: Information Gain = 0.2364194873469705 ECPAS: Information Gain = 0.23641657123361748 ABHD4: Information Gain = 0.23637866115833006 RPL4P6: Information Gain = 0.2363036868995687 FGFR4: Information Gain = 0.2362851261834915 KLKP1: Information Gain = 0.23628109535672048 SUMO2P17: Information Gain = 0.23625510052834864 ARHGAP22: Information Gain = 0.23623946532772178 P4HA3-AS1: Information Gain = 0.23617976504243576 SCGB1D2: Information Gain = 0.23615145338248777 SPATA6: Information Gain = 0.23614821891595095 SMU1P1: Information Gain = 0.23611168508326363 RSL1D1: Information Gain = 0.2361066542375616 ZNF460: Information Gain = 0.23608134312003792 MIDEAS: Information Gain = 0.2360607398333634 SND1-IT1: Information Gain = 0.23605486123649144 ACKR2: Information Gain = 0.23602877185842663 SUMO2P21: Information Gain = 0.23595731353440552 ANKRD34A: Information Gain = 0.23593691851440757 CAD: Information Gain = 0.23591446181258768 ZMAT1: Information Gain = 0.23588003890349096 TDRD12: Information Gain = 0.2358469247037367 TRBV30: Information Gain = 0.23584247285696702 RAC3: Information Gain = 0.2358360565867108 SULT2B1: Information Gain = 0.23582814129391894 C11orf98: Information Gain = 0.23582718221021848 ZNF841: Information Gain = 0.2358014367939365 P3H2: Information Gain = 0.23578790584651532 GJB5: Information Gain = 0.2357723218463399 SNAP91: Information Gain = 0.23571822615296179 HDLBP: Information Gain = 0.23571571294198534 NQO2-AS1: Information Gain = 0.2355932218401282 ANKRD1: Information Gain = 0.23555654903596346 CCDC80: Information Gain = 0.23553204210950018 KY: Information Gain = 0.23549463636774814 SPINK8: Information Gain = 0.2354582553776967 IL6R: Information Gain = 0.23544375224590008 PCDH20: Information Gain = 0.23542105346883746 ACTG1P20: Information Gain = 0.23535656233267455 RBP1: Information Gain = 0.23530991647445965 SPTLC3: Information Gain = 0.23530633626513753 GAPDHP38: Information Gain = 0.2352794753465186 OIP5: Information Gain = 0.23526578168307188 DNAJB6P2: Information Gain = 0.23524576603003844 SERPINB5: Information Gain = 0.23519217480665922 DHRS7: Information Gain = 0.2351873236701587 ESCO2: Information Gain = 0.2351739514749529 MIR4737: Information Gain = 0.23516785134824114 GATA5: Information Gain = 0.235154459441397 NCAPH: Information Gain = 0.23512567882719893 CLSPN: Information Gain = 0.23510814502237776 MIR6833: Information Gain = 0.2351052539645786 PPP2R2A: Information Gain = 0.23507915515739297 MIR4428: Information Gain = 0.2350300007826278 CDH13: Information Gain = 0.2350221825968586 GAPDH-DT: Information Gain = 0.23500945620925529 RNF157: Information Gain = 0.23500545349828772 GJA3: Information Gain = 0.2349905071603775 TMTC1: Information Gain = 0.23498078656628785 ZNF853: Information Gain = 0.23494329950921777 GATA2-AS1: Information Gain = 0.23490809800577184 ATAD5: Information Gain = 0.2348975652410843 MIR4793: Information Gain = 0.23489316102954616 ZNF710: Information Gain = 0.2348831226606407 COL4A3: Information Gain = 0.2347762367264199 FTH1P10: Information Gain = 0.23477602905858386 PPFIBP2: Information Gain = 0.23476177016773625 TMPRSS13: Information Gain = 0.23474004393427905 AFAP1-AS1: Information Gain = 0.23473132799331253 NEK2: Information Gain = 0.23468818933868718 ANK1: Information Gain = 0.2346605572180216 SNORD35B: Information Gain = 0.234659677226716 BTG3-AS1: Information Gain = 0.23463223243992215 MIR6730: Information Gain = 0.23461925006747086 BMP6: Information Gain = 0.23457173426443045 ZDHHC11B: Information Gain = 0.2345582194555198 MARK3: Information Gain = 0.23453120878207168 NCOR2: Information Gain = 0.23451631114332994 CALM2P2: Information Gain = 0.2345034363982481 ADAM20P1: Information Gain = 0.2344745496315921 IL18: Information Gain = 0.23444633658426217 SCHLAP1: Information Gain = 0.2343876237319984 CDH16: Information Gain = 0.2343726407828981 ZBTB20: Information Gain = 0.23434826657879393 LINC02343: Information Gain = 0.23433802685512095 ZNF697: Information Gain = 0.2341827352160084 OXER1: Information Gain = 0.23417849468199115 CCDC148-AS1: Information Gain = 0.2341705768194533 EIF2S2P3: Information Gain = 0.23415706648183066 ZNF654: Information Gain = 0.23413596586315832 KLHDC8B: Information Gain = 0.23409236043410808 EN2: Information Gain = 0.23406175589778289 EFNB1: Information Gain = 0.23402197673087644 ALDOC: Information Gain = 0.23399372701590293 HGH1: Information Gain = 0.23392852117873608 SNORD69: Information Gain = 0.23390800063854722 INTS4P1: Information Gain = 0.2338783188774325 NDUFB8P2: Information Gain = 0.23378741775522194 NBEAP5: Information Gain = 0.23374895921749395 MBOAT7: Information Gain = 0.23373229807129725 ACSBG1: Information Gain = 0.2337180421091536 LINC01016: Information Gain = 0.23371112463784782 EIF4H: Information Gain = 0.23370231910041062 LINC01529: Information Gain = 0.23367982063349602 FGD3: Information Gain = 0.23365611534501785 FAM83G: Information Gain = 0.2335551356484875 RRAS: Information Gain = 0.23355231840434998 STX17-DT: Information Gain = 0.2335406068507735 UBASH3B: Information Gain = 0.23352974466419285 CCDC137: Information Gain = 0.23350652021129936 HLF: Information Gain = 0.23349848866088485 PPP1R9A: Information Gain = 0.23349440330080684 IRF2-DT: Information Gain = 0.2333809392049604 CAPN8: Information Gain = 0.23336648475071864 DLX5: Information Gain = 0.23333269377863286 PTGES: Information Gain = 0.2333086555652979 KCNIP4: Information Gain = 0.23329945689950327 OXR1-AS1: Information Gain = 0.23328325183729448 LHX6: Information Gain = 0.23327352226642972 PIGW: Information Gain = 0.2332446083488866 VN1R48P: Information Gain = 0.23321735376362973 MIR6865: Information Gain = 0.23319675239242144 FEM1B: Information Gain = 0.23319671170696443 EMILIN3: Information Gain = 0.2331913137297872 MIR4640: Information Gain = 0.23315423632717547 IL17C: Information Gain = 0.23309380249306555 MIR6866: Information Gain = 0.23306262443230308 RNF122: Information Gain = 0.23298730807710188 LINC02656: Information Gain = 0.2329709496873833 ZNF295-AS1: Information Gain = 0.232966130161665 SLC25A5: Information Gain = 0.2329449371433967 CCDC175: Information Gain = 0.23293304214092014 C7orf61: Information Gain = 0.23285278822323763 RASGEF1C: Information Gain = 0.2328295211424114 ABCC4: Information Gain = 0.23281774176640968 EMP1: Information Gain = 0.232816567738523 CACNA1C: Information Gain = 0.23279890471247433 FBXL7: Information Gain = 0.2327874470768816 TFF2: Information Gain = 0.23278707820711264 SRD5A3: Information Gain = 0.23275601543143143 KRT87P: Information Gain = 0.23274842365349957 PLEKHB1: Information Gain = 0.23272064049653074 MANCR: Information Gain = 0.23270259318008435 GCHFR: Information Gain = 0.23269121398471637 HBEGF: Information Gain = 0.2326535283469089 DMRT1: Information Gain = 0.23253757023935306 TOMM40P1: Information Gain = 0.23247253904359688 GPR132: Information Gain = 0.232458067931969 SNORD56: Information Gain = 0.23245679520410034 CNIH2: Information Gain = 0.23245052638634078 ALDH3A1: Information Gain = 0.23239599183341864 P2RX2: Information Gain = 0.23232648950449297 NKPD1: Information Gain = 0.23226547382185148 HEBP2: Information Gain = 0.232261418779806 S1PR4: Information Gain = 0.2322485154838838 PRAP1: Information Gain = 0.23224724146529607 PCSK5: Information Gain = 0.2322430028003648 EFCAB6-DT: Information Gain = 0.23223037082316322 GPAA1: Information Gain = 0.23222132118302308 MT-TS2: Information Gain = 0.23220516858278328 IRX4: Information Gain = 0.23217588539569478 GUCY2C: Information Gain = 0.23217581710357305 SORCS1: Information Gain = 0.23212525404585627 ZFP69B: Information Gain = 0.23210107983836314 OR7E36P: Information Gain = 0.2320763163044426 SLC4A8: Information Gain = 0.23199545715791192 LARGE2: Information Gain = 0.2319933300211745 RACGAP1: Information Gain = 0.2319765556466471 FAM83E: Information Gain = 0.2319397992466179 LAPTM5: Information Gain = 0.231930572741474 GABARAPL1: Information Gain = 0.23192454510472915 AFF3: Information Gain = 0.23189708853926883 KCNN3: Information Gain = 0.2318955630511017 SMPD5: Information Gain = 0.23169334194735525 OTOAP1: Information Gain = 0.2316768299242058 PPP1R14BP2: Information Gain = 0.23166161097503934 NEIL3: Information Gain = 0.23162061798097455 LINGO3: Information Gain = 0.23161261349599593 SPX: Information Gain = 0.23160753268229683 VCP: Information Gain = 0.2315938924680503 TMEM51-AS1: Information Gain = 0.23157210545311702 SMOC2: Information Gain = 0.2315269279483616 GATD3A: Information Gain = 0.2314951890054211 SFXN5: Information Gain = 0.23149098729416684 MIR6775: Information Gain = 0.23146620682255858 AGPAT4: Information Gain = 0.23146117763087837 ZNF333: Information Gain = 0.2314560201583995 CSRP2: Information Gain = 0.23140629739882113 NUGGC: Information Gain = 0.2314019797526865 RPL23AP49: Information Gain = 0.2313942614840634 ACRV1: Information Gain = 0.2313729340737185 ANTKMT: Information Gain = 0.23137206392080745 ATP6V1D: Information Gain = 0.23133762926655255 TCIRG1: Information Gain = 0.23131646941986395 CCDC87: Information Gain = 0.23124092017082543 NPIPB2: Information Gain = 0.2312298621470521 ELAC2: Information Gain = 0.23120505494305132 EIF4A1P5: Information Gain = 0.2312017818026566 KRT23: Information Gain = 0.2311784125083518 RACK1P1: Information Gain = 0.23117404483375936 MSLNL: Information Gain = 0.2311665218427632 HPGD: Information Gain = 0.23111044588020824 ADGRE2: Information Gain = 0.23108769521333272 USH1G: Information Gain = 0.23106856603183967 DLEU2L: Information Gain = 0.23106739102246787 SHLD1: Information Gain = 0.23105536014662276 EIF4BP5: Information Gain = 0.23105319052878603 TRPC6: Information Gain = 0.23100914381351956 SNORD62B: Information Gain = 0.2309993331914897 LINC01176: Information Gain = 0.23099913464757882 KCNJ3: Information Gain = 0.23099506880055198 CSF1: Information Gain = 0.2309821078313219 TSPAN13: Information Gain = 0.23093687043601463 CDKN2C: Information Gain = 0.2309138826057031 MASP1: Information Gain = 0.23087644840274457 MIR4751: Information Gain = 0.23086448889066213 PVRIG: Information Gain = 0.23086386616603582 LINC01164: Information Gain = 0.23085150908555563 FRG1HP: Information Gain = 0.23082259763193336 PLAGL1: Information Gain = 0.23080793212493678 CASC15: Information Gain = 0.23079779391664745 LCN2: Information Gain = 0.23079632403960382 PLA2G2A: Information Gain = 0.23075411006565294 THUMPD1P1: Information Gain = 0.23072701099064896 PLAAT4: Information Gain = 0.23071144201305827 RAB11FIP5: Information Gain = 0.23070240265800956 NDUFA13: Information Gain = 0.23065125946901088 NEDD9: Information Gain = 0.23061832660186554 NT5DC4: Information Gain = 0.2306160914738553 YWHAZP5: Information Gain = 0.23061237366965925 SOWAHA: Information Gain = 0.2305840846264151 PNMA6B: Information Gain = 0.2305245174031838 TRAV19: Information Gain = 0.23052345245181738 LKAAEAR1: Information Gain = 0.23050888465384012 ARMT1: Information Gain = 0.23050317901806827 LRRC10B: Information Gain = 0.23047049413026754 EEF1A1P22: Information Gain = 0.23046974239957163 LRAT: Information Gain = 0.23045169847899327 MARCKS: Information Gain = 0.2304424078168501 GCSHP5: Information Gain = 0.23042914897446742 SNORA10: Information Gain = 0.23042526573913125 CBR1: Information Gain = 0.23039629732986255 KRTAP5-1: Information Gain = 0.2303954155330603 MIR6891: Information Gain = 0.23036869714952735 DLGAP3: Information Gain = 0.230363909577177 FGR: Information Gain = 0.23035000672058903 GSTA4: Information Gain = 0.23034444997069325 C3: Information Gain = 0.23027123510516834 SOCS3-DT: Information Gain = 0.23026389160096494 PSPC1-AS2: Information Gain = 0.23024607334417024 ALDH1L1: Information Gain = 0.2302202557874462 DSG2-AS1: Information Gain = 0.23018186494890003 TNFSF4: Information Gain = 0.2301587652081203 WNT3: Information Gain = 0.23014054940928053 ZNF135: Information Gain = 0.23013528527109983 AMD1: Information Gain = 0.2301311492410425 FAM184A: Information Gain = 0.23012551555252836 SEC1P: Information Gain = 0.23009097530256573 NECTIN4: Information Gain = 0.23005318855418033 LINC00160: Information Gain = 0.23004866839005889 CR2: Information Gain = 0.23002674007733503 CD68: Information Gain = 0.23001967522129774 SFTPA2: Information Gain = 0.2299979308351403 SNORA77B: Information Gain = 0.22999419404023946 MAB21L4: Information Gain = 0.2299914859348775 CTAGE15: Information Gain = 0.22993586350473794 PLAC9P1: Information Gain = 0.22992279849585584 SLC8A1-AS1: Information Gain = 0.22989654122860226 ANKRD17-DT: Information Gain = 0.22988082153911749 TRIL: Information Gain = 0.22984279516678674 EGFLAM: Information Gain = 0.22983572870916524 MIR6741: Information Gain = 0.2298330448311312 TUBB1: Information Gain = 0.2298208632217431 KCNK12: Information Gain = 0.2298129346590203 RUNX2-AS1: Information Gain = 0.2298002478803418 CLMN: Information Gain = 0.2297991522081988 VEPH1: Information Gain = 0.22975130258451393 ATP5MF: Information Gain = 0.22972355208489725 LINC01714: Information Gain = 0.2297117617645852 TPBGL-AS1: Information Gain = 0.2296775134937108 ADH6: Information Gain = 0.22966996754391955 RGL1: Information Gain = 0.2296659552252147 CASC19: Information Gain = 0.22966240150290962 DNAH10: Information Gain = 0.22962121616853826 RN7SK: Information Gain = 0.22961958453211606 UBE2L4: Information Gain = 0.22961617940460588 ARMC7: Information Gain = 0.2295981215111269 ADGRG5: Information Gain = 0.22955321455626887 DLGAP4-AS1: Information Gain = 0.22954405406888356 PHETA2: Information Gain = 0.22947905594665863 APLP2: Information Gain = 0.22945430620343998 GATA4: Information Gain = 0.22944336555111322 GTF2IP7: Information Gain = 0.2294364963636233 LMCD1: Information Gain = 0.22943289307834758 SNF8: Information Gain = 0.22940427088806614 TTC9-DT: Information Gain = 0.22938428541842537 FGFBP3: Information Gain = 0.22937116277519376 FAM91A2P: Information Gain = 0.22936690567458728 CDK18: Information Gain = 0.22936049898035038 CLUHP10: Information Gain = 0.22931479636863927 SPINK14: Information Gain = 0.22931413335373585 PTPDC1: Information Gain = 0.22928463965749102 DTX4: Information Gain = 0.2292789544246685 GSTM3P2: Information Gain = 0.22926945172274338 LDHAP1: Information Gain = 0.22924470344376924 SNORA12: Information Gain = 0.2292423357682165 NTF4: Information Gain = 0.22919255449738163 GAPDHP52: Information Gain = 0.22915420297501443 NUS1P2: Information Gain = 0.22915199313409662 CCT5P1: Information Gain = 0.229150435937854 PRKCD: Information Gain = 0.22908472183607675 BHLHA15: Information Gain = 0.22907350277702498 RAET1L: Information Gain = 0.22904843864555868 LINC01732: Information Gain = 0.22903372454987636 PHC2: Information Gain = 0.22901479774564204 COLEC10: Information Gain = 0.22899778581053942 RASSF2: Information Gain = 0.22899495328182162 DSCC1: Information Gain = 0.22894491129439287 PGM5P2: Information Gain = 0.228857235247927 ATP5PDP4: Information Gain = 0.22883211917902946 TENT4A: Information Gain = 0.22882904545812788 PPIC: Information Gain = 0.22881323328832037 HAAO: Information Gain = 0.2287965490035444 FOXRED2: Information Gain = 0.2287943331359905 LINC01918: Information Gain = 0.22875583910135067 SYT5: Information Gain = 0.22872988324955146 LINC01290: Information Gain = 0.22872511998743472 POU2F2: Information Gain = 0.22868267694279898 KCNJ18: Information Gain = 0.2286652195504375 KIZ-AS1: Information Gain = 0.22864279808505783 MIR339: Information Gain = 0.22862925709944726 SVIL2P: Information Gain = 0.22860465991529155 APBA1: Information Gain = 0.22860272258759795 RETN: Information Gain = 0.2285986920152283 ZNF337-AS1: Information Gain = 0.22857619961581332 TMEFF1: Information Gain = 0.22857216333360642 LINC02716: Information Gain = 0.22854606378873954 SERPINE1: Information Gain = 0.22854051023410626 MYLK3: Information Gain = 0.22853421625801285 ANO1-AS1: Information Gain = 0.2285231053458523 DBF4B: Information Gain = 0.22851687315132163 ASRGL1: Information Gain = 0.22850274824531747 USP30: Information Gain = 0.22850064792349367 SNX25P1: Information Gain = 0.22848478324144206 CYYR1-AS1: Information Gain = 0.2284669539859867 ADAM20: Information Gain = 0.2284666623466043 CEACAM7: Information Gain = 0.22846132203645442 SMARCD2: Information Gain = 0.22845739649515573 FAT2: Information Gain = 0.22844982647949874 ZNF732: Information Gain = 0.22844556442145159 ASTL: Information Gain = 0.22844548381326457 FRMD6: Information Gain = 0.22842757787723666 TNFAIP3: Information Gain = 0.2284214688942432 TRAF6: Information Gain = 0.22838161966177806 C1RL: Information Gain = 0.2283815778243261 LINC02428: Information Gain = 0.22837262594463592 LINC00173: Information Gain = 0.22836762437775304 PLEKHA2: Information Gain = 0.22835522767117022 SPIN1: Information Gain = 0.2283486892995057 BMP1: Information Gain = 0.2283359008621757 LINC01275: Information Gain = 0.2283352403998522 PDE6D: Information Gain = 0.22833030943374477 ACSM3: Information Gain = 0.2283152441581895 FBXL4: Information Gain = 0.22829407958123427 VWA5A: Information Gain = 0.22829108499257766 SHANK3: Information Gain = 0.22826858750821621 KRT19P1: Information Gain = 0.22826266161390518 TUBAP2: Information Gain = 0.22821577669872362 RPS3AP27: Information Gain = 0.22821570628307386 SYNGR1: Information Gain = 0.22819741267002924 MED28-DT: Information Gain = 0.22818700145674353 MRAP: Information Gain = 0.22815952911094595 MT-TM: Information Gain = 0.22813948229938785 LINC01517: Information Gain = 0.22812486405465848 RLIMP1: Information Gain = 0.22809992146187197 ERVE-1: Information Gain = 0.22809215705527963 RNU6-438P: Information Gain = 0.22808056207894567 MEF2C: Information Gain = 0.22806679808077246 INTU: Information Gain = 0.2280632654274557 ZNF285B: Information Gain = 0.22805870307019194 STK19B: Information Gain = 0.22804827381487924 C6orf58: Information Gain = 0.2280389592173655 LINC02352: Information Gain = 0.22803018369139316 C21orf62-AS1: Information Gain = 0.22802103026748788 AP1B1: Information Gain = 0.22796184894562832 VPS13B-DT: Information Gain = 0.22794349509049106 IFIT2: Information Gain = 0.22791663784369764 KANK3: Information Gain = 0.22790854358490287 TTC9B: Information Gain = 0.2279046306887842 FAM171A1: Information Gain = 0.22786985637529256 CNN2P9: Information Gain = 0.22783807442070492 CCNO-DT: Information Gain = 0.2278319986967665 DHRS9: Information Gain = 0.22782809302275853 PSMG3: Information Gain = 0.22782238745867356 DSG1-AS1: Information Gain = 0.22780916855763178 HKDC1: Information Gain = 0.22780863837786747 PEG13: Information Gain = 0.22779498048381686 HAS2-AS1: Information Gain = 0.22779115137605666 NEU1: Information Gain = 0.2277703881039823 CLIP3: Information Gain = 0.22775238371347784 OR11H13P: Information Gain = 0.22775152765309303 CCR8: Information Gain = 0.2277311435587206 GP2: Information Gain = 0.22771394630184405 PLCL2-AS1: Information Gain = 0.2277032755274766 ZNF133-AS1: Information Gain = 0.2277015307249055 LTB4R2: Information Gain = 0.22768855471276184 SNTN: Information Gain = 0.2276819234964058 CHSY3: Information Gain = 0.22767332573317245 TBC1D24: Information Gain = 0.22762975050879097 TENM4: Information Gain = 0.22762536117268684 GALNT6: Information Gain = 0.22762169433300605 GAL3ST1: Information Gain = 0.22759000160390452 TIGD2: Information Gain = 0.22758298565986435 USP2-AS1: Information Gain = 0.22758288466494903 CYCSP38: Information Gain = 0.2275745909002942 MIR3064: Information Gain = 0.22756249608297896 NR4A3: Information Gain = 0.2275534132181698 LINC01132: Information Gain = 0.22754756028709244 CDA: Information Gain = 0.2275462689914507 ACVR1: Information Gain = 0.22753185600116677 CES5AP1: Information Gain = 0.22753144106744227 GRM1: Information Gain = 0.22749418050802972 SHMT1P1: Information Gain = 0.22747104223539139 RMI2: Information Gain = 0.2274708843735025 IL12A: Information Gain = 0.22745562998509583 ELL2P1: Information Gain = 0.22744901301426856 ABCC1: Information Gain = 0.22744535689747503 LCMT2: Information Gain = 0.22743047640061742 LINC00957: Information Gain = 0.2274188455966053 EPHA8: Information Gain = 0.22740618739162977 PDAP1: Information Gain = 0.2274020397084886 MRPS7: Information Gain = 0.22737187051922825 SNX31: Information Gain = 0.22735819840729454 IGFBP5: Information Gain = 0.22735177674425278 RPL35AP16: Information Gain = 0.2273331035619317 PCDH12: Information Gain = 0.22732971154114678 GRK6P1: Information Gain = 0.22732864766301453 UPK1B: Information Gain = 0.22730810946002866 GAPDHP26: Information Gain = 0.22730390322502436 AFAP1L1: Information Gain = 0.22729730626195854 RPS10P7: Information Gain = 0.22729495182864135 MARK3P3: Information Gain = 0.2272838939441013 MARCHF1: Information Gain = 0.22725490733314446 RFX3: Information Gain = 0.2272165479592092 HNRNPRP1: Information Gain = 0.22721376858723152 TENM3: Information Gain = 0.22720837987545606 GSG1: Information Gain = 0.22719568650836264 TRAPPC1: Information Gain = 0.22719431262756418 GAPDHP45: Information Gain = 0.22719082476852326 EIF1P3: Information Gain = 0.2271889071740265 RNU6-914P: Information Gain = 0.22718183660208968 PRDX3P1: Information Gain = 0.2271612580075002 CGNL1: Information Gain = 0.22714883245418527 TSPAN18: Information Gain = 0.22713928975921127 CHKB-DT: Information Gain = 0.22712640993158306 LBX2: Information Gain = 0.2271192479273212 DNAH3: Information Gain = 0.22708982408389367 PRR22: Information Gain = 0.2270703232959681 ATP4B: Information Gain = 0.22703650702820433 DNMT1: Information Gain = 0.22702306569582853 AKR1C3: Information Gain = 0.22699087944660779 LINC00705: Information Gain = 0.22697349481624696 CRHR2: Information Gain = 0.22697349481624696 MRPL23-AS1: Information Gain = 0.22695418650889598 MIR4658: Information Gain = 0.22691228944083908 CLIP2: Information Gain = 0.22689610340538402 RXRG: Information Gain = 0.22689274256481973 SNX18: Information Gain = 0.22687267772777764 GGT5: Information Gain = 0.22685342067566427 NEDD8: Information Gain = 0.2268518201655607 MIR6875: Information Gain = 0.22684950169879436 VGF: Information Gain = 0.22681178139835412 CCDC9B: Information Gain = 0.22680268337784049 NACA: Information Gain = 0.22678709483002457 AARS1: Information Gain = 0.22676983575144716 IGHG2: Information Gain = 0.2267623518692301 ZBTB32: Information Gain = 0.2267584860054883 DLL3: Information Gain = 0.22675735691001053 ZRANB2-AS2: Information Gain = 0.2267355565578264 LAMB2P1: Information Gain = 0.22672780928509528 HLA-J: Information Gain = 0.22670604850602838 DACH1: Information Gain = 0.22670545253488972 TOR3A: Information Gain = 0.22670418835066042 ICAM3: Information Gain = 0.22669970240701343 PFDN4: Information Gain = 0.22669431037635723 DUOX1: Information Gain = 0.22667840634549585 MPPED2: Information Gain = 0.22665568309145478 HABP2: Information Gain = 0.22664344789226254 NRAP: Information Gain = 0.22664344789226254 KAT6B: Information Gain = 0.22660842709695483 ENHO: Information Gain = 0.22660126075897713 GBAP1: Information Gain = 0.22658633569859066 ANGPT4: Information Gain = 0.2265561377341183 EBF3: Information Gain = 0.22655179922545887 MAPK6P4: Information Gain = 0.22653409845051486 MLXP1: Information Gain = 0.22650691050782235 GRIK5: Information Gain = 0.2265028746939317 ZMAT3: Information Gain = 0.22649566072391614 CEACAM8: Information Gain = 0.22649248117438958 SEMA6D: Information Gain = 0.22645885502924368 PDZK1P1: Information Gain = 0.22642971786729782 SMIM10L2B-AS1: Information Gain = 0.22642451242613992 GALNT5: Information Gain = 0.22642258154574924 LIPK: Information Gain = 0.22641701289332405 CICP4: Information Gain = 0.22640698845378648 AMER2: Information Gain = 0.22640589339035988 SPRY3: Information Gain = 0.2264049576619629 FAR2P2: Information Gain = 0.2263985983508503 FAM219A: Information Gain = 0.22639711175591004 ZFP2: Information Gain = 0.22639686338741516 DPF3: Information Gain = 0.22638908908417066 SCGB1B2P: Information Gain = 0.22637228308586033 PRDM11: Information Gain = 0.22634712085379238 RPL34P18: Information Gain = 0.22633084375036083 ADRB2: Information Gain = 0.22629490690215492 ACE: Information Gain = 0.22627316656825291 WNT11: Information Gain = 0.22625968631184823 LINC01143: Information Gain = 0.22621925416788136 KCND1: Information Gain = 0.22621855415589076 DENND5A: Information Gain = 0.22621485664629803 CNTNAP5: Information Gain = 0.22619860554215054 KIF20A: Information Gain = 0.2261913425987636 KNTC1: Information Gain = 0.22613616517355228 SNORD35A: Information Gain = 0.22613016625497107 UCA1: Information Gain = 0.22612051096401697 FEM1C: Information Gain = 0.22611044636517574 ERICH2: Information Gain = 0.22609200489285008 BRI3: Information Gain = 0.22604587559156974 TBX15: Information Gain = 0.22604527473332725 NEURL2: Information Gain = 0.2260092052910243 LCP2: Information Gain = 0.22600775965682263 KCTD21-AS1: Information Gain = 0.22598867586535665 POFUT2: Information Gain = 0.22598627904365753 UBA52P7: Information Gain = 0.22597724723079482 DSN1: Information Gain = 0.22593736620386706 RSRC2: Information Gain = 0.22592582520453242 PARP6: Information Gain = 0.22591244835177293 GOLGA6L4: Information Gain = 0.22585987765373616 RPL22P2: Information Gain = 0.2258563298921792 SEMA5B: Information Gain = 0.2258487492656709 HS3ST5: Information Gain = 0.22584424627690125 ABHD6: Information Gain = 0.2258378150095499 CSPG4P12: Information Gain = 0.2258360018119352 MVD: Information Gain = 0.225818193882376 SPEF1: Information Gain = 0.22581669393865234 ZBTB8OSP2: Information Gain = 0.22580600687585917 TIPARP: Information Gain = 0.2258039304974493 KIF18A: Information Gain = 0.225793881330286 CD2AP: Information Gain = 0.22578813190559677 MIR193A: Information Gain = 0.22577487031936583 SNTG2-AS1: Information Gain = 0.22576888348062907 POTEJ: Information Gain = 0.22575930605638206 TCIM: Information Gain = 0.2257503153907754 HCG4P8: Information Gain = 0.2257485323905728 GFI1: Information Gain = 0.22570000412270663 RNF165: Information Gain = 0.22569283128557416 SRA1: Information Gain = 0.22568535628360875 ZNF725P: Information Gain = 0.22566644877939868 PLA2G4F: Information Gain = 0.22564522580635948 TMEM156: Information Gain = 0.2256413299432718 FRG1EP: Information Gain = 0.22562728395669573 SHH: Information Gain = 0.22562379440663283 CD3E: Information Gain = 0.22560161105216947 LINC00501: Information Gain = 0.22559460916455842 ZNF723: Information Gain = 0.22558349132339672 FTH1P13: Information Gain = 0.2255589216124545 SCGB2A2: Information Gain = 0.22553542243190394 PCDHA4: Information Gain = 0.22553275185339716 FLT1: Information Gain = 0.22552985376771018 RASA4CP: Information Gain = 0.22549314520521535 SLITRK4: Information Gain = 0.22548833932817192 SDHDP6: Information Gain = 0.22547406390262736 SNORD117: Information Gain = 0.22545116968430023 SETP10: Information Gain = 0.22544972738785285 SNORA9: Information Gain = 0.22542240153389304 PDE6B: Information Gain = 0.22538863636529682 MAML2: Information Gain = 0.2253755902234167 HOTTIP: Information Gain = 0.2253736090571823 IFIT1: Information Gain = 0.22537305667278695 SYT3: Information Gain = 0.22535564999767232 PEX11G: Information Gain = 0.2253467516865837 WNT9A: Information Gain = 0.22534530593327706 LBP: Information Gain = 0.22534284204340027 PAFAH1B2P1: Information Gain = 0.22532973103345388 CNTN3: Information Gain = 0.22528457853993356 RCAN2: Information Gain = 0.22527033584070533 SEC62-AS1: Information Gain = 0.22526529769394177 DISP2: Information Gain = 0.2252622223761438 COX7A2P2: Information Gain = 0.22525483175028693 SIAH2-AS1: Information Gain = 0.22524075449389303 CKS1BP1: Information Gain = 0.22523539539884196 SPRY2: Information Gain = 0.22522541411757557 PC: Information Gain = 0.22522417101826897 MIR6814: Information Gain = 0.22521957311013763 OR51B5: Information Gain = 0.22521540761512182 NR3C2: Information Gain = 0.22518710691281818 ORC1: Information Gain = 0.22516466371877186 RPL12P13: Information Gain = 0.2251475749305043 SOWAHD: Information Gain = 0.22513441170626503 RPF2P1: Information Gain = 0.2251218445119152 FTH1P23: Information Gain = 0.22509199629335885 GAPDHP28: Information Gain = 0.22509142400035254 TSFM: Information Gain = 0.22508794892311212 PSMC5: Information Gain = 0.22508677666522448 ITGA2B: Information Gain = 0.22508555408704845 ZNF17: Information Gain = 0.2250688721009264 CCDC40: Information Gain = 0.22505262685973615 MIR6876: Information Gain = 0.22505106718675205 GLRX3P2: Information Gain = 0.22502183495385264 PTGER3: Information Gain = 0.22500315461377252 CREB3L2: Information Gain = 0.22500145566589058 SH3BP1: Information Gain = 0.225000596604771 FNDC4: Information Gain = 0.22499764947903422 TLE2: Information Gain = 0.22498618734807652 TGM1: Information Gain = 0.2249574575898059 PCDH8: Information Gain = 0.22494841406937982 PDZD2: Information Gain = 0.224929091222291 GTF3C6P2: Information Gain = 0.22492273955395214 UBE2CP4: Information Gain = 0.22491914192358342 ADCY7: Information Gain = 0.22491496679287293 VTN: Information Gain = 0.22491049963973309 LENG9: Information Gain = 0.2249100046177761 BNIP3P10: Information Gain = 0.22489929399975672 KIAA0930: Information Gain = 0.22488047073878525 FAUP2: Information Gain = 0.2248677238072061 CEMP1: Information Gain = 0.22486762153051654 ZC3H6: Information Gain = 0.2248586303949276 BNIP3P11: Information Gain = 0.22485690672312453 PDPN: Information Gain = 0.22482364171925684 CTNNA1P1: Information Gain = 0.22481906950685415 LY96: Information Gain = 0.2248136519447348 RPSAP14: Information Gain = 0.2247869464644885 WBP1LP2: Information Gain = 0.22473356013892798 RNU6-1055P: Information Gain = 0.22472541141619673 NIM1K: Information Gain = 0.22471945433407026 GPR87: Information Gain = 0.22471765399731547 MIR6510: Information Gain = 0.22468024055665725 RPL23AP8: Information Gain = 0.2246568612539377 MIR936: Information Gain = 0.22464747150583197 FZD9: Information Gain = 0.22463434556733408 ZNF74: Information Gain = 0.22462347796896465 USP8P1: Information Gain = 0.22461498868977015 KLB: Information Gain = 0.22461112184483278 KAT5: Information Gain = 0.22458576851011336 LINC01772: Information Gain = 0.22457190938955618 CLDND2: Information Gain = 0.22456922485683628 GPD1: Information Gain = 0.2245660318799867 ALDH2: Information Gain = 0.22454359735029517 TUFMP1: Information Gain = 0.22453735096733252 IRF1-AS1: Information Gain = 0.2245168967065445 GATA3-AS1: Information Gain = 0.22451072466360866 ANKRD49P2: Information Gain = 0.2244936333261467 ACACB: Information Gain = 0.22448438036138962 COL5A3: Information Gain = 0.22447406229690814 KCNMB1: Information Gain = 0.2244103975766465 RPL21P8: Information Gain = 0.22440428517129907 AGAP1-IT1: Information Gain = 0.22439010443446428 ZNF727: Information Gain = 0.22436672603563879 RPGRIP1: Information Gain = 0.22434974016347708 LINC00519: Information Gain = 0.22434577060661276 DSEL-AS1: Information Gain = 0.22434033444538537 PCDHAC1: Information Gain = 0.22432881742056132 MAP6: Information Gain = 0.22431405985225084 MYT1: Information Gain = 0.22429125125116545 MED10: Information Gain = 0.22428745944785589 PHF24: Information Gain = 0.22428099909452515 SLC30A6-DT: Information Gain = 0.22427906656585672 MMP1: Information Gain = 0.224269064147274 LINC02485: Information Gain = 0.22426789050499818 PGAM4: Information Gain = 0.22426471489326172 PITPNM3: Information Gain = 0.22425669199442733 AOX2P: Information Gain = 0.22425488274688177 RAET1E-AS1: Information Gain = 0.22422007668349364 LINC00323: Information Gain = 0.22419835976758895 SAV1: Information Gain = 0.22416015461134964 MRTFA-AS1: Information Gain = 0.22415746289935146 RNU6-436P: Information Gain = 0.22413098252457875 FBXO30-DT: Information Gain = 0.22412766691679398 PLCB2: Information Gain = 0.2241006486346262 PLEKHH2: Information Gain = 0.22407149849044905 RPL32P20: Information Gain = 0.22406380003571136 CNNM1: Information Gain = 0.2240365309759731 HECW2-AS1: Information Gain = 0.2240342339282666 HOPX: Information Gain = 0.22402758358792663 RPL17P36: Information Gain = 0.22402362399406095 RPL39P3: Information Gain = 0.22398811009599906 RASSF4: Information Gain = 0.22398105951040614 LINC01637: Information Gain = 0.22396875953406226 ZNF793: Information Gain = 0.2239634264600754 MIR6763: Information Gain = 0.22395602859440644 MMP2: Information Gain = 0.2239336576409816 LINC00365: Information Gain = 0.22393299226905872 ESR1: Information Gain = 0.22392655497283398 WNT5A-AS1: Information Gain = 0.2239231773394883 LINC01409: Information Gain = 0.22392144987971374 PTMAP12: Information Gain = 0.22391648198093939 KCTD12: Information Gain = 0.22390479825068077 TMEM171: Information Gain = 0.22389030392094145 RPL21P89: Information Gain = 0.2238889437415459 MCF2: Information Gain = 0.22386986003244336 LINC01094: Information Gain = 0.22386608979363398 KCNV2: Information Gain = 0.22385950045346314 OR1L8: Information Gain = 0.22385950045346314 RAMP2-AS1: Information Gain = 0.2238516927588834 PRSS3: Information Gain = 0.22384388228119945 SLAMF8: Information Gain = 0.22383744223794277 PDE4C: Information Gain = 0.22383449397190613 SLC17A5: Information Gain = 0.223825556852288 SEPTIN9-DT: Information Gain = 0.22381461621186416 SNCA: Information Gain = 0.22379868361360233 FOXI1: Information Gain = 0.2237823024125356 SMILR: Information Gain = 0.22376586199936876 PTPN21: Information Gain = 0.22374775789672774 EEF1A1P9: Information Gain = 0.22373872356552593 SMIM35: Information Gain = 0.22373852372919711 PCSK9: Information Gain = 0.2237358946038952 PTCHD3: Information Gain = 0.2237358946038952 SH3TC2-DT: Information Gain = 0.2237358946038952 CCDC106: Information Gain = 0.22370260486146054 CEL: Information Gain = 0.22369465402260924 TMEM230P1: Information Gain = 0.22369031712421594 S100A8: Information Gain = 0.2236851150660939 MT1E: Information Gain = 0.22368358358034435 GABARAPL3: Information Gain = 0.22368172186962187 RASSF10-DT: Information Gain = 0.22368124054146232 PTBP1P: Information Gain = 0.2236723483834242 PAICSP1: Information Gain = 0.22366634370986227 LINC00539: Information Gain = 0.2236598385358184 SCARNA12: Information Gain = 0.2236514802059537 DSG4: Information Gain = 0.22363517910156006 TCN1: Information Gain = 0.2236326518020011 ROR2: Information Gain = 0.22363017403009655 WDR62: Information Gain = 0.22361681155004431 LINC00276: Information Gain = 0.22359977295294997 USP54: Information Gain = 0.223573855941392 HNRNPM: Information Gain = 0.2235703198365835 EPX: Information Gain = 0.22356907813584592 IL2RG: Information Gain = 0.22356005075716512 TP73: Information Gain = 0.22355316355094756 PSMD10P2: Information Gain = 0.22354123920974178 LINC01152: Information Gain = 0.22353810979376854 NSG2: Information Gain = 0.22353810979376854 PRSS21: Information Gain = 0.22353810979376854 LINC00239: Information Gain = 0.2235316621027017 ZNF625: Information Gain = 0.22351737313437225 C1orf158: Information Gain = 0.22350457833804804 PSMB8: Information Gain = 0.22349754423291257 SNRPCP3: Information Gain = 0.22348872644850837 CD101: Information Gain = 0.22347511393290342 PBK: Information Gain = 0.22346589165294795 LINC01697: Information Gain = 0.223465041389179 NACAP2: Information Gain = 0.22346423301621443 SLC25A24P1: Information Gain = 0.22345097043718787 CDC42P5: Information Gain = 0.22344337787854984 MAST1: Information Gain = 0.2234333665628374 RPL7P44: Information Gain = 0.2234311121028345 LHFPL6: Information Gain = 0.22342757520484247 WWOX: Information Gain = 0.22342597629845584 RPS27AP6: Information Gain = 0.22342153556264144 RNA5SP260: Information Gain = 0.22341102626391973 CCL28: Information Gain = 0.2234060330129648 MIR583HG: Information Gain = 0.22340599987712784 IL6-AS1: Information Gain = 0.2234054813818509 C16orf86: Information Gain = 0.22338640638859686 MYO3B: Information Gain = 0.2233666766622071 ZXDB: Information Gain = 0.2233661126779063 CNGB1: Information Gain = 0.22334737837512297 TMSB10P1: Information Gain = 0.22332260495281187 OR2A42: Information Gain = 0.22331377933395768 MIR937: Information Gain = 0.2233054371729164 SLC25A38P1: Information Gain = 0.22328472993450688 IMPDH1P9: Information Gain = 0.22328058828398767 TMEM229A: Information Gain = 0.22327793900021264 KLHDC7B: Information Gain = 0.22326907150031894 MECP2: Information Gain = 0.22326118857987254 NAV2-AS2: Information Gain = 0.22325237939059916 C11orf94: Information Gain = 0.22323616883601916 MIR3654: Information Gain = 0.22323337706410062 ZNF804B: Information Gain = 0.22323330737608993 SH3BP4: Information Gain = 0.22323063465259807 MFF-DT: Information Gain = 0.22320311371472124 BRPF3-AS1: Information Gain = 0.223193706493944 PARVB: Information Gain = 0.2231919398546971 RDM1: Information Gain = 0.2231886161213814 LGALS1: Information Gain = 0.22318826380748824 SETP8: Information Gain = 0.22318708688210576 BHMT: Information Gain = 0.22318433091147916 MIX23P5: Information Gain = 0.22316732934717143 CCDC60: Information Gain = 0.22313278631859879 TBXA2R: Information Gain = 0.2231305297794559 LINC02157: Information Gain = 0.22309733900948614 LINC00115: Information Gain = 0.22308694112915228 HEATR4: Information Gain = 0.223084288100313 TPT1P6: Information Gain = 0.22307167833676478 CCDC17: Information Gain = 0.22306395883097285 IL17RD: Information Gain = 0.22306161663587543 ACTG1P15: Information Gain = 0.2230388903218974 LINC00894: Information Gain = 0.2230322926968593 DYRK3: Information Gain = 0.22302438033747252 RNF157-AS1: Information Gain = 0.22302376084481956 TTC3-AS1: Information Gain = 0.22301918681454524 RSAD2: Information Gain = 0.22301198715924264 RPL15P2: Information Gain = 0.22300658149052488 ANAPC10P1: Information Gain = 0.22300494581019792 MIF4GD-DT: Information Gain = 0.2229884017495336 ZBED6CL: Information Gain = 0.22297773857181147 MED14OS: Information Gain = 0.22296262437477732 PTMAP11: Information Gain = 0.22294952509312127 MZB1: Information Gain = 0.22294784886073282 RSKR: Information Gain = 0.22294754879185885 ZNF551: Information Gain = 0.22294611717876012 GPAT4-AS1: Information Gain = 0.22294376478483735 CKMT2: Information Gain = 0.22294070511609632 MIR3918: Information Gain = 0.2229390706913459 RSL24D1: Information Gain = 0.2229373956984424 SNX19P3: Information Gain = 0.2229342976229538 SQLE-DT: Information Gain = 0.2229331899246083 LINC01424: Information Gain = 0.22292821201801072 GPRC5D-AS1: Information Gain = 0.22292487493762891 SMCR5: Information Gain = 0.22292352105948665 GAPDHP2: Information Gain = 0.22291567990475358 ZNF702P: Information Gain = 0.22291125133807133 FKBP6: Information Gain = 0.222908156156437 LINC01535: Information Gain = 0.22289948439571639 TROAP: Information Gain = 0.22288866918779227 NAA20P1: Information Gain = 0.2228800244977458 EEF1DP8: Information Gain = 0.22287920160507646 SAP18P2: Information Gain = 0.22285897326944637 ZNF391: Information Gain = 0.2228344763519845 MIR27B: Information Gain = 0.22283182533609214 LINC01356: Information Gain = 0.22281207463921526 RPS2P24: Information Gain = 0.22277693794873588 KRT6A: Information Gain = 0.22277693794873588 TF: Information Gain = 0.22276994904329195 BIRC3: Information Gain = 0.22274767879057666 NOXO1: Information Gain = 0.22274724696824677 EPHX4: Information Gain = 0.22270507499301107 CPB2-AS1: Information Gain = 0.22270416567054663 FOXB1: Information Gain = 0.22270128496166608 CCDC184: Information Gain = 0.22268075224703976 DSCAML1: Information Gain = 0.22266116061060348 MIR7706: Information Gain = 0.222658206250139 LINC01892: Information Gain = 0.22265761412083074 MIR6746: Information Gain = 0.22265573217433077 IGSF9B: Information Gain = 0.2226379337305091 GLYCTK-AS1: Information Gain = 0.2226372416175102 GAB2: Information Gain = 0.22263651714613042 TAF7L: Information Gain = 0.22260849877009203 HMSD: Information Gain = 0.22260286609095914 GFRA3: Information Gain = 0.22260097738399875 PAEP: Information Gain = 0.22258747159566616 LINC01285: Information Gain = 0.22256730636063615 GSEC: Information Gain = 0.22256195756015118 IDSP1: Information Gain = 0.22256148655267038 HNF1A: Information Gain = 0.22255917367147293 PDZD4: Information Gain = 0.22255675030074906 F2R: Information Gain = 0.22250655782279338 MARCHF5: Information Gain = 0.22250223912603095 UNC93B3: Information Gain = 0.22249821281836502 FAM124A: Information Gain = 0.22248728595897815 ARMC10P1: Information Gain = 0.22247549665768207 SUGT1P4: Information Gain = 0.2224722148173437 CRYM: Information Gain = 0.22245248738172219 TAS2R31: Information Gain = 0.2224494659773737 ST13P15: Information Gain = 0.2224391254502467 ARL5AP5: Information Gain = 0.2224369651632765 PTP4A1P4: Information Gain = 0.22243452614788795 HS3ST3A1: Information Gain = 0.22242764399185 RNVU1-19: Information Gain = 0.22242268218893435 SV2C: Information Gain = 0.22240451806077233 SOHLH1: Information Gain = 0.22240451806077233 MAPRE2: Information Gain = 0.22239690162821546 ACTG2: Information Gain = 0.22237283921038875 SFMBT2: Information Gain = 0.2223697866760277 HYI: Information Gain = 0.2223417442935447 SCX: Information Gain = 0.22233544836447594 RPL24P2: Information Gain = 0.22231914946925135 PTX3: Information Gain = 0.22231663292685488 KIF21B: Information Gain = 0.222303983241958 MIR4434: Information Gain = 0.22229783889512356 CCNYL7: Information Gain = 0.22229454540333138 RPL7P8: Information Gain = 0.2222818759942513 RNA5SP221: Information Gain = 0.22226061475374803 LINC01425: Information Gain = 0.2222479742113883 CHRFAM7A: Information Gain = 0.22223621099270696 NHLRC1: Information Gain = 0.2222334428773145 WNT4: Information Gain = 0.22223101703581905 SF3B4P1: Information Gain = 0.22222075896697002 NBEAP6: Information Gain = 0.22221788135833287 RPSAP26: Information Gain = 0.2222144545220639 MIR215: Information Gain = 0.2221823764732278 MEX3B: Information Gain = 0.22218210741933642 LETR1: Information Gain = 0.22215655926850997 ZSCAN18: Information Gain = 0.2221483230839829 PRDM16: Information Gain = 0.22214610228317855 MAST3: Information Gain = 0.2221451324347805 EEF1A1P12: Information Gain = 0.22212929172816143 PRKG2: Information Gain = 0.2221263144493386 IL1R2: Information Gain = 0.2221209773359285 FANCE: Information Gain = 0.2221138314835598 CDH5: Information Gain = 0.22211203027244864 RHOT1P3: Information Gain = 0.2221066168898902 MTRNR2L8: Information Gain = 0.2221033370875347 XIAPP1: Information Gain = 0.2220994433886092 BRI3BP: Information Gain = 0.22209759569355625 DPYSL5: Information Gain = 0.22207887957593342 CDCA3: Information Gain = 0.22207528667595744 EPAS1: Information Gain = 0.22206453649622881 LINC02506: Information Gain = 0.22205937368831985 MYADM: Information Gain = 0.22205572410206775 CRMP1: Information Gain = 0.2220547272046527 ARHGAP42-AS1: Information Gain = 0.22203959875940837 ACTG1P9: Information Gain = 0.22201835513703982 CFHR5: Information Gain = 0.22201753578815087 SUSD3: Information Gain = 0.2220004708859531 OR8B10P: Information Gain = 0.22197360234562846 NT5CP1: Information Gain = 0.22197279982638118 POU5F1B: Information Gain = 0.2219718697983868 PRNCR1: Information Gain = 0.22196461443962257 MIR4740: Information Gain = 0.22195885613450672 SRP9P1: Information Gain = 0.22195423451650864 DYSF: Information Gain = 0.22195280673714257 ATP5MKP1: Information Gain = 0.22195274014067423 TUBB2BP1: Information Gain = 0.22194299475738877 ADAM29: Information Gain = 0.22193775513735714 EHD4-AS1: Information Gain = 0.22193493665764663 ZFHX2: Information Gain = 0.221929966868053 AGXT: Information Gain = 0.2219279237192977 PLAC4: Information Gain = 0.22192654729184813 NPM1P46: Information Gain = 0.2219210440804631 CRISPLD1: Information Gain = 0.22192097741785988 HOXA5: Information Gain = 0.22191297943802368 TNFRSF14: Information Gain = 0.22191225364445732 MIR21: Information Gain = 0.22189958133390753 EID2B: Information Gain = 0.22189750083659798 ADTRP: Information Gain = 0.22189466676394587 CIT: Information Gain = 0.221888015271535 RAB42: Information Gain = 0.22187640839548672 PTPRB: Information Gain = 0.2218761348201832 SDSL: Information Gain = 0.22187572108348896 RN7SL535P: Information Gain = 0.22186900912025886 ZNF114-AS1: Information Gain = 0.22185524828120462 PTTG3P: Information Gain = 0.2218462778438559 MMP11: Information Gain = 0.2218454319690215 KRT8P1: Information Gain = 0.2218410422971755 ERVK-28: Information Gain = 0.2218379345036825 NEAT1: Information Gain = 0.22183672616170957 FDPSP4: Information Gain = 0.22183118827333925 RPS6KA6: Information Gain = 0.22180340205538207 RBM22P2: Information Gain = 0.22180000653853948 ITPRIP: Information Gain = 0.22179639544312213 LINC02680: Information Gain = 0.22178412405687586 C1orf216: Information Gain = 0.22176781611361163 FDPSP7: Information Gain = 0.2217650682973238 PTPRD: Information Gain = 0.22174514643348298 RN7SL659P: Information Gain = 0.22174244168394264 MIR3190: Information Gain = 0.22174146803063377 RNU6-163P: Information Gain = 0.22174146803063377 C21orf62: Information Gain = 0.22174146803063377 SEC14L1P1: Information Gain = 0.22173893158360447 ADRA1B: Information Gain = 0.2217388042802726 RTEL1: Information Gain = 0.2217311540095095 TTC23L-AS1: Information Gain = 0.22172544524211957 GLS2: Information Gain = 0.22172056591624623 CALN1: Information Gain = 0.2217064268882354 TGM3: Information Gain = 0.2217064268882354 CCN6: Information Gain = 0.2217064268882354 ZNF577: Information Gain = 0.22169893100519067 WDR77: Information Gain = 0.22167803996158875 RPL21P44: Information Gain = 0.22167737951717803 PTPRM: Information Gain = 0.22166789500422568 SOSTDC1: Information Gain = 0.22166303907562268 SYDE1: Information Gain = 0.2216630243693254 PRDX2P1: Information Gain = 0.22164266123455678 KANSL1L-AS1: Information Gain = 0.22162657258758967 BPIFA4P: Information Gain = 0.22160600229448413 FAM95C: Information Gain = 0.22158952186983782 SOBP: Information Gain = 0.22156479325018186 LINC00621: Information Gain = 0.22156082477410521 STAB2: Information Gain = 0.22155948451609264 BACE2: Information Gain = 0.22154550533981854 MIR3187: Information Gain = 0.22154398645310724 EMSLR: Information Gain = 0.221543710819881 LINC02318: Information Gain = 0.22153626600308352 DUTP6: Information Gain = 0.22153245103652708 UBE2R2-AS1: Information Gain = 0.2215268954705416 SLC7A1: Information Gain = 0.22152248258280638 FRG1-DT: Information Gain = 0.22151132947501995 ADGRD1: Information Gain = 0.2215103305245052 RNA5SP343: Information Gain = 0.22150938535911435 MAG: Information Gain = 0.2215092907476175 ZNF25: Information Gain = 0.22150622274268894 MIR5196: Information Gain = 0.22150475031801875 MIR6834: Information Gain = 0.22149541485736557 PNMT: Information Gain = 0.22149219638104567 RPL23AP52: Information Gain = 0.22149219638104567 RPL35AP2: Information Gain = 0.2214915146925096 SNORA25: Information Gain = 0.22147502369396932 TRAF6P1: Information Gain = 0.22146748041487418 HIGD1AP14: Information Gain = 0.22144351836598042 ARMH1: Information Gain = 0.22143135239739165 DLGAP4: Information Gain = 0.22142963747256084 LINC01508: Information Gain = 0.2214178640476454 SCUBE1: Information Gain = 0.22136643139998813 LRMDA: Information Gain = 0.22132988135410403 CDC20P1: Information Gain = 0.22131904078253273 FBXL2: Information Gain = 0.2213071951155159 OR7E29P: Information Gain = 0.22130468009800985 RNU6-780P: Information Gain = 0.2212888209376005 FCF1P1: Information Gain = 0.22128377816169675 GLRB: Information Gain = 0.22127206508413688 ALG8: Information Gain = 0.22126967484723759 IL6: Information Gain = 0.22126196737124815 CAVIN3: Information Gain = 0.2212541125591012 MLPH: Information Gain = 0.22123445679029285 LINC02178: Information Gain = 0.2212335746005727 POTEF: Information Gain = 0.22121867900919523 LINC00572: Information Gain = 0.22121468875885575 ATOH8: Information Gain = 0.2212091995210832 NLGN1: Information Gain = 0.2212084699106105 HORMAD2-AS1: Information Gain = 0.2211920907002185 EMILIN2: Information Gain = 0.22118849984627098 NLRP2B: Information Gain = 0.22118630062420053 SHBG: Information Gain = 0.2211752804762448 FUT5: Information Gain = 0.22116578170214374 GJA1P1: Information Gain = 0.22114589607053792 PIEZO2: Information Gain = 0.22114315126162198 SPINK2: Information Gain = 0.2211416118705951 SLC12A8: Information Gain = 0.2211326369395803 CAPN9: Information Gain = 0.22111935408520988 MYCL: Information Gain = 0.22111700506025356 DDX3Y: Information Gain = 0.22110576296182094 SAMSN1: Information Gain = 0.22110119650155835 CFTR: Information Gain = 0.22110023308958748 GPR161: Information Gain = 0.22108707966903784 KRT17P6: Information Gain = 0.22108692631323734 TOMM20L-DT: Information Gain = 0.2210811478035788 KCNG2: Information Gain = 0.2210805332688699 TEX44: Information Gain = 0.22107636309337697 CDK8P1: Information Gain = 0.22107234271358056 HCG4B: Information Gain = 0.22105021049080387 ATP6V1E1P1: Information Gain = 0.22103638389450753 ASB14: Information Gain = 0.22103427379390506 FRG1KP: Information Gain = 0.22102006508152794 ANKRD7: Information Gain = 0.22102006508152794 ATP5PBP2: Information Gain = 0.22102006508152794 ASS1P8: Information Gain = 0.22101291453036498 MIAT: Information Gain = 0.22101264165587953 MN1: Information Gain = 0.22100559739741188 BMPR1B: Information Gain = 0.2210055720640267 AOX1: Information Gain = 0.22100281673818523 CHP1P3: Information Gain = 0.2209906342420629 ZNF462: Information Gain = 0.22097049348198938 PTPRVP: Information Gain = 0.22095107412042325 DNAI4: Information Gain = 0.22094808537624733 ACAD8: Information Gain = 0.22094636443109228 SNORA60: Information Gain = 0.22093633169206006 ALG1L13P: Information Gain = 0.22092977727861718 CATSPERE: Information Gain = 0.2209231190594696 EIF4A2P1: Information Gain = 0.2209193910218481 GAPDHS: Information Gain = 0.2209180375382187 CMAHP: Information Gain = 0.22091183325151942 KLK10: Information Gain = 0.220901643018385 RN7SKP30: Information Gain = 0.2208964884672997 LINC00350: Information Gain = 0.22089045436043642 SLC35E1: Information Gain = 0.22086406972135686 IFITM3P2: Information Gain = 0.22086168648381221 ABCA10: Information Gain = 0.22086094080226104 LHX1: Information Gain = 0.22085075155134448 MIR1260B: Information Gain = 0.22083268535761058 CYP2C8: Information Gain = 0.22083057336973932 PGAM1P7: Information Gain = 0.22082042020697568 BRAFP1: Information Gain = 0.22081567857363527 ITGA9: Information Gain = 0.22080445019757478 CRB2: Information Gain = 0.22079984445573242 CHRNA7: Information Gain = 0.22077564846464814 RPS15AP12: Information Gain = 0.2207738812303457 NUP50P1: Information Gain = 0.22077320721268956 ARHGEF35: Information Gain = 0.2207696488413926 MAP3K7CL: Information Gain = 0.2207639886710342 KPNA4P1: Information Gain = 0.22074837107651102 HYKK: Information Gain = 0.22073991405574955 FCGR2B: Information Gain = 0.22072709000446888 TRIML2: Information Gain = 0.22072067974718723 TNRC6B-DT: Information Gain = 0.2207071499264479 UBR5-DT: Information Gain = 0.22069990681501062 TMEM130: Information Gain = 0.2206858012030959 SOX21-AS1: Information Gain = 0.22068535993338645 BMS1P22: Information Gain = 0.22068440249786692 TLR3: Information Gain = 0.22068030170048014 RPL13AP23: Information Gain = 0.22065861429473554 LINC02226: Information Gain = 0.22065219929623203 RAB28P5: Information Gain = 0.2206480762614711 BDKRB2: Information Gain = 0.2206200243428078 RN7SL130P: Information Gain = 0.220616371256803 FRG1FP: Information Gain = 0.22061606389890054 CHKA-DT: Information Gain = 0.22060544087028955 RNU4-22P: Information Gain = 0.22060431629365218 NDUFB2: Information Gain = 0.22059479019570372 NDUFAB1P1: Information Gain = 0.22058994673523635 TEX53: Information Gain = 0.2205869797622637 SLC25A48: Information Gain = 0.22058415141147614 ABCB4: Information Gain = 0.22058132848620504 KRTAP10-2: Information Gain = 0.2205809021548779 HRH1: Information Gain = 0.22057211171887237 RPL6P25: Information Gain = 0.22057211171887237 RBM22P4: Information Gain = 0.22057211171887237 EGFLAM-AS1: Information Gain = 0.2205595901231432 PPP1R2B: Information Gain = 0.2205445967130093 CYCSP24: Information Gain = 0.22053417497207484 GABPB1: Information Gain = 0.22053313538213093 RNU6-957P: Information Gain = 0.22052846689674666 RAD21P1: Information Gain = 0.22051123373060766 ROM1: Information Gain = 0.22050510586640182 IGHG4: Information Gain = 0.2204976062256645 PDCD6IPP2: Information Gain = 0.22049344843234775 SALL2: Information Gain = 0.2204840186775059 CPP: Information Gain = 0.2204819755360976 ELOVL3: Information Gain = 0.22046964169242877 ADAMTS6: Information Gain = 0.22046223454930525 FAM3B: Information Gain = 0.22045815815283043 COX20P2: Information Gain = 0.2204554386591615 MTND5P26: Information Gain = 0.2204535549943032 NASPP1: Information Gain = 0.2204467725528938 LINC00589: Information Gain = 0.22044222476579511 ZNF132-DT: Information Gain = 0.2204364698830712 EYS: Information Gain = 0.22043208990307117 RPS19P7: Information Gain = 0.22042729882751888 PTGES2: Information Gain = 0.22042607037568307 LINC02600: Information Gain = 0.22042196214014043 MRPS11: Information Gain = 0.2204174704210342 PRKCZ-AS1: Information Gain = 0.22040403075674875 PLEKHO2: Information Gain = 0.22039198610693744 MIR16-1: Information Gain = 0.2203862567861823 MTATP8P1: Information Gain = 0.2203780034564724 DNAAF4: Information Gain = 0.220374642998302 ABI1: Information Gain = 0.220369538283115 SEPHS2: Information Gain = 0.22036452529147854 UGP2: Information Gain = 0.22035628942337282 SUSD2: Information Gain = 0.22035016581989275 TSSK2: Information Gain = 0.22034510712808886 MIR6823: Information Gain = 0.22034401742428766 CARS1: Information Gain = 0.22034046168236232 CAMP: Information Gain = 0.220337157928147 SERPINA6: Information Gain = 0.22033329037264515 BDKRB1: Information Gain = 0.2203157653363803 LINC00845: Information Gain = 0.2203115790793131 TMEM178A: Information Gain = 0.22030427243535544 APBA2: Information Gain = 0.22029278268318198 IBSP: Information Gain = 0.2202872245377676 RN7SKP56: Information Gain = 0.22028015166013337 CTBP2P3: Information Gain = 0.22026564843499763 ISM1-AS1: Information Gain = 0.22026067890309342 RPL12P28: Information Gain = 0.22025942253068775 FGF7: Information Gain = 0.22025508521513726 ADGRG3: Information Gain = 0.22025244938264232 NEXMIF: Information Gain = 0.22024957939411172 RNU6-319P: Information Gain = 0.22024806756298632 SPATA4: Information Gain = 0.220242128492266 NBPF20: Information Gain = 0.22022339159489213 RPL36P4: Information Gain = 0.2202197994176296 GPC2: Information Gain = 0.22021706900763305 ABLIM1: Information Gain = 0.22021161568628034 JPH1: Information Gain = 0.2202091342265473 MIR3960: Information Gain = 0.220207882712675 OR5M3: Information Gain = 0.22020446094709478 ST8SIA6: Information Gain = 0.22019324599983814 LINC02641: Information Gain = 0.22017905456551334 ARF1P1: Information Gain = 0.22017481392960625 NPM1P24: Information Gain = 0.22017215016111957 MIR6838: Information Gain = 0.22016914050731984 IGHEP1: Information Gain = 0.22016809542242832 CTRB2: Information Gain = 0.22015365599705583 MYLK-AS1: Information Gain = 0.2201420681097086 VPS26BP1: Information Gain = 0.22012047868105133 MYOG: Information Gain = 0.2201010527262195 FBN1: Information Gain = 0.2200941445839022 SRSF3P5: Information Gain = 0.22008894509646537 RAP1AP: Information Gain = 0.22007902258378476 CROCCP4: Information Gain = 0.2200759559193557 SPDYE21: Information Gain = 0.22007428383904482 FOXN1: Information Gain = 0.22006668861071566 ATP5PBP7: Information Gain = 0.2200648921451973 TPI1P4: Information Gain = 0.22005749737741742 ZBTB39: Information Gain = 0.22004285547511748 FAM183A: Information Gain = 0.22003883410462777 ADH4: Information Gain = 0.2200359993920573 PLA2G1B: Information Gain = 0.2200273330095368 ELN: Information Gain = 0.2200273330095368 GNE: Information Gain = 0.22002514024312592 EEF1A1P29: Information Gain = 0.2200138235779603 RPL22P24: Information Gain = 0.22000314348523142 CD207: Information Gain = 0.2200030679623748 MIR146B: Information Gain = 0.22000117995145896 LINC02280: Information Gain = 0.22000117995145896 LINC02055: Information Gain = 0.21999467681015838 PLP1: Information Gain = 0.2199908126493908 MIR4482: Information Gain = 0.21998741720067394 MRPS5P3: Information Gain = 0.21998652355144244 LINC02888: Information Gain = 0.21998398158548227 TRAV29DV5: Information Gain = 0.2199681710041721 CATSPERZ: Information Gain = 0.21996065168358414 HMGA2-AS1: Information Gain = 0.21995895026710333 TINAGL1: Information Gain = 0.21995714187941018 MIR6506: Information Gain = 0.2199531323011754 LCE1B: Information Gain = 0.21994833968711025 BCAP31P2: Information Gain = 0.21994125864423641 COX5AP2: Information Gain = 0.2199360143267559 MIR1279: Information Gain = 0.219925615107186 CSRP3-AS1: Information Gain = 0.2199183697843652 LINC02012: Information Gain = 0.2199137423105031 MIR6779: Information Gain = 0.2199114473939241 TRBV20OR9-2: Information Gain = 0.21990712137473944 RPL8P2: Information Gain = 0.21990301112888577 OPN3: Information Gain = 0.21987213988064713 HCAR2: Information Gain = 0.21987050217394688 VSIG1: Information Gain = 0.21985388349446788 LDLRAD4-AS1: Information Gain = 0.21984962745208425 TDRP: Information Gain = 0.21984420783516034 LIPE: Information Gain = 0.21984070346878326 MIX23P3: Information Gain = 0.21983972941730645 TSPY26P: Information Gain = 0.21982566224851374 GLULP4: Information Gain = 0.21982523383152497 SCHIP1: Information Gain = 0.21980425115156055 MTMR9LP: Information Gain = 0.21979908399584702 CCNI2: Information Gain = 0.21979560745696203 CLPS: Information Gain = 0.219795038719238 DLGAP5: Information Gain = 0.2197871507926623 TOLLIP-DT: Information Gain = 0.2197850834165027 SMIM6: Information Gain = 0.2197794795971264 EDA: Information Gain = 0.21977647980597115 LINC01686: Information Gain = 0.2197756156692119 ADAMTS7: Information Gain = 0.21977089992082544 SMCO2: Information Gain = 0.21976613148251256 RN7SKP116: Information Gain = 0.21976091228049066 H1-12P: Information Gain = 0.21975871357632437 KLF7P1: Information Gain = 0.21971961394908956 FNTAP1: Information Gain = 0.21971812984950434 MIR3609: Information Gain = 0.21971664574991912 LINC02518: Information Gain = 0.21967551588205358 NAV2-AS3: Information Gain = 0.21966797782150427 RASA3-IT1: Information Gain = 0.2196619020270172 MTX3: Information Gain = 0.2196608614134199 OR8A3P: Information Gain = 0.21965207812895238 MPC1-DT: Information Gain = 0.2196469697121981 ZNF827: Information Gain = 0.21963611150938722 LINC00634: Information Gain = 0.2196246579544603 BMS1P15: Information Gain = 0.2196211683233089 YWHAZP2: Information Gain = 0.2196110332880763 HAL: Information Gain = 0.219608344198863 RPL3P8: Information Gain = 0.21960492942674015 PRTN3: Information Gain = 0.21960001666211482 PDE10A: Information Gain = 0.21956517063595848 TTLL1-AS1: Information Gain = 0.21955125922544516 UMODL1-AS1: Information Gain = 0.219547848673457 OR10D3: Information Gain = 0.219547848673457 RPS4XP8: Information Gain = 0.21954433334535994 ARHGAP29: Information Gain = 0.21953476817445772 SH2D5: Information Gain = 0.2195030724431366 COPS8P2: Information Gain = 0.21949981045591582 MIR6075: Information Gain = 0.21949431293537747 RPS26P41: Information Gain = 0.2194900187063995 KCNG4: Information Gain = 0.21948654995293104 CEP126: Information Gain = 0.21947755807834657 MGAT4EP: Information Gain = 0.21947503916078115 SLC2A3P4: Information Gain = 0.21946603320333713 MKI67: Information Gain = 0.21946528281250544 TMPRSS7: Information Gain = 0.21946082119991628 RNA5SP283: Information Gain = 0.219456087679613 KCNJ6: Information Gain = 0.21943333766393436 PROKR1: Information Gain = 0.21941528682787026 YPEL5P2: Information Gain = 0.21939220869490983 MSN: Information Gain = 0.2193906542874151 RN7SL431P: Information Gain = 0.2193906542874151 SPEF2: Information Gain = 0.2193877364698562 TGIF1P1: Information Gain = 0.21938660483443506 AKAP12: Information Gain = 0.21937891539318266 GRM6: Information Gain = 0.2193774936421804 SLC6A16: Information Gain = 0.21937462466881663 CHRNE: Information Gain = 0.21936572378661245 RPL18AP15: Information Gain = 0.2193612785432859 GATA6-AS1: Information Gain = 0.21936016573304618 BACH1-IT1: Information Gain = 0.21935426750261056 LINC01441: Information Gain = 0.21933735041446734 CAMK2D: Information Gain = 0.21933703767933332 LINC01134: Information Gain = 0.21933365314320574 SLC5A5: Information Gain = 0.21932781758763942 MAFTRR: Information Gain = 0.2193270998495862 HMGN1P35: Information Gain = 0.2193107983135718 GPR37L1: Information Gain = 0.2193028839303255 MIR6844: Information Gain = 0.2193011873417241 NELL1: Information Gain = 0.21929780585595915 GJA1: Information Gain = 0.21929780585595915 MRAP-AS1: Information Gain = 0.21929780585595915 MESP2: Information Gain = 0.21929693393484873 ALMS1-IT1: Information Gain = 0.21929503911302017 GRXCR2: Information Gain = 0.2192911844885781 SPIRE1: Information Gain = 0.21928945916033982 GSTP1: Information Gain = 0.21928941894672427 CYP4Z1: Information Gain = 0.21927098652157762 KRT8P43: Information Gain = 0.21926272321480256 SLC52A3: Information Gain = 0.21925499846142937 CBX5P1: Information Gain = 0.2192509704853678 MIR4690: Information Gain = 0.21925042676521245 TSSK3: Information Gain = 0.219241184261594 TXNP4: Information Gain = 0.21924111163652849 FOXD2-AS1: Information Gain = 0.21923856847924794 DAPK1: Information Gain = 0.21922888693836962 C16orf92: Information Gain = 0.21921962898128666 PLCD4: Information Gain = 0.2192169088074578 TCEAL8: Information Gain = 0.21921434629667402 PPIL1: Information Gain = 0.21921052125174256 MANBA: Information Gain = 0.21920962617657525 LINC01747: Information Gain = 0.21917262074557264 DNM3: Information Gain = 0.21916517635286792 PRICKLE2-AS3: Information Gain = 0.2191527986306836 CCDC110: Information Gain = 0.21913983032938478 HOMER2: Information Gain = 0.21911711239292053 NPIPA9: Information Gain = 0.21911442121728752 MIR6790: Information Gain = 0.21911213885067182 TMSB15B-AS1: Information Gain = 0.21910763914173637 IFI6: Information Gain = 0.21910588704502199 ZNF419: Information Gain = 0.21910083392756619 SYT11: Information Gain = 0.21908664219542828 LINC02851: Information Gain = 0.21908629971090865 SNTG1: Information Gain = 0.21908617954090182 HCLS1: Information Gain = 0.21907176607295709 UBASH3A: Information Gain = 0.2190716119662348 OR8G5: Information Gain = 0.21907154179839305 HLA-DQB2: Information Gain = 0.21907035137657793 KCTD5P1: Information Gain = 0.2190605595716808 GSDMD: Information Gain = 0.21905790344421439 NRN1L: Information Gain = 0.21904979442287442 GAB3: Information Gain = 0.21904968381574164 EIF3IP1: Information Gain = 0.21904923283932698 RNF222: Information Gain = 0.21904708523340322 SLC22A13: Information Gain = 0.21904708523340322 CLRN1-AS1: Information Gain = 0.21904708523340322 GNG10P1: Information Gain = 0.21904708523340322 HSP90AA4P: Information Gain = 0.21904708523340322 CDHR4: Information Gain = 0.21904708523340322 EXTL3-AS1: Information Gain = 0.2190447046177244 PSMC1P8: Information Gain = 0.21904391735650885 MIR5188: Information Gain = 0.21903092130714952 P2RY1: Information Gain = 0.21902744320363343 EIF3LP1: Information Gain = 0.21902242009734452 TMTC2: Information Gain = 0.2190084516323363 KLF3P1: Information Gain = 0.21900313305337482 F7: Information Gain = 0.21899325808305692 SV2B: Information Gain = 0.21899256233175324 OR8T1P: Information Gain = 0.21898137927787364 RNF20: Information Gain = 0.21896515982186848 ANKRD11P2: Information Gain = 0.2189622085299292 DDX59-AS1: Information Gain = 0.21895832495642553 OPN1SW: Information Gain = 0.2189557586157176 LINC01366: Information Gain = 0.21894912509176523 NLRP3P1: Information Gain = 0.21894145879094218 LINC00534: Information Gain = 0.21893950787544347 SEPTIN7P8: Information Gain = 0.21893651981429318 PHBP7: Information Gain = 0.21893285800001427 RNU6-883P: Information Gain = 0.21892721893201106 GAPDHP67: Information Gain = 0.21892590279418256 RRN3P2: Information Gain = 0.2189184160830855 CHI3L1: Information Gain = 0.21889713409452782 OXCT1: Information Gain = 0.21889641293245288 MFAP4: Information Gain = 0.21889040132177984 BET1: Information Gain = 0.2188866253984434 RPS2P2: Information Gain = 0.2188808340078625 HYI-AS1: Information Gain = 0.21887865080526514 IDH1-AS1: Information Gain = 0.21887509085839008 PINCR: Information Gain = 0.2188742479929644 PAQR8: Information Gain = 0.21887259534247505 ZNF460-AS1: Information Gain = 0.21883663258049846 MIRLET7F1: Information Gain = 0.2188290299632898 PSMC1P11: Information Gain = 0.2188280046207478 H2BC18: Information Gain = 0.2188096058930673 ALDH1A3-AS1: Information Gain = 0.21880736814590906 GAPDHP48: Information Gain = 0.21880694049583993 ZNF649: Information Gain = 0.21880694049583993 PHF2P2: Information Gain = 0.21880300075640458 PPARGC1A: Information Gain = 0.21879140751251058 ANP32BP1: Information Gain = 0.218766766933584 ADAMTS2: Information Gain = 0.2187602544481544 RNU6-418P: Information Gain = 0.2187601439033635 MAP3K2-DT: Information Gain = 0.2187580077336324 AATBC: Information Gain = 0.21874842324766464 RNA5SP439: Information Gain = 0.21874393797237723 HMGN2P38: Information Gain = 0.21873991603138698 FAM3D: Information Gain = 0.21873622345102217 RTCA-AS1: Information Gain = 0.21873485306064455 HIC2: Information Gain = 0.21872301921703152 UGT1A12P: Information Gain = 0.21872075456688234 FHAD1: Information Gain = 0.21871522997142812 PCOLCE2: Information Gain = 0.2187127878725148 LINC00858: Information Gain = 0.21870465510381876 HS3ST6: Information Gain = 0.21869951451816072 MAPK8IP2: Information Gain = 0.21869565168378702 TAPT1-AS1: Information Gain = 0.21869534641190347 SLC1A6: Information Gain = 0.2186946869847768 LINC00664: Information Gain = 0.21869217855047607 RPL21P41: Information Gain = 0.218687419406965 INPP5J: Information Gain = 0.21868212550470467 SCARNA3: Information Gain = 0.21868028431280462 MTND4LP30: Information Gain = 0.21868028431280462 HLA-DRB9: Information Gain = 0.21867873166085072 STX7: Information Gain = 0.2186646990998251 PRB3: Information Gain = 0.21865564876809906 VDAC1P7: Information Gain = 0.2186459065574473 TONSL-AS1: Information Gain = 0.21864207477623432 TLR6: Information Gain = 0.21863522233859056 SF3A3P1: Information Gain = 0.21863403584962438 SHOX2: Information Gain = 0.21862163415733993 MIR637: Information Gain = 0.21860478225315694 LINC01397: Information Gain = 0.21859768846280359 OR8B2: Information Gain = 0.21859760424973795 RN7SL743P: Information Gain = 0.2185961699027421 MIR193B: Information Gain = 0.2185924370335015 HAUS6P1: Information Gain = 0.2185844986537926 PTGS1: Information Gain = 0.2185815519207872 ZNF320: Information Gain = 0.21857343432859766 LINC00266-1: Information Gain = 0.21857055620549604 MRPS31P2: Information Gain = 0.2185674709357257 SF3A3P2: Information Gain = 0.21856459722484112 LEFTY1: Information Gain = 0.21855053976059158 SYNPR-AS1: Information Gain = 0.21854939025259967 RN7SL164P: Information Gain = 0.21854751957406937 ALOX12B: Information Gain = 0.2185437183844534 MIR421: Information Gain = 0.21854070194658992 MT-TV: Information Gain = 0.21853550613251738 HERC2P3: Information Gain = 0.21853280843229528 CNN2P12: Information Gain = 0.21853193143502136 DNAI3: Information Gain = 0.21853050472551327 IMPDH1P2: Information Gain = 0.21852787431071108 MIR4523: Information Gain = 0.21851635166987804 MIR4675: Information Gain = 0.21851474689915507 SNORD34: Information Gain = 0.21849411075184388 RPS23P1: Information Gain = 0.21848786705704026 HENMT1: Information Gain = 0.21848461833662958 GNRH1: Information Gain = 0.21846534545465257 C5AR2: Information Gain = 0.21846079127856144 ARX: Information Gain = 0.21845725973983932 LUADT1: Information Gain = 0.2184523070646942 RPS5P2: Information Gain = 0.21845073406347582 SLCO1A2: Information Gain = 0.21843826909010544 GDAP1L1: Information Gain = 0.21842502246462492 NADK2-AS1: Information Gain = 0.21842004715008567 SLC6A19: Information Gain = 0.2184132595687691 HBQ1: Information Gain = 0.21841267403160147 LRP1: Information Gain = 0.21841096369940338 HMGN2P10: Information Gain = 0.21840909365395555 PLAC1: Information Gain = 0.2184012608821373 ANKRD49P1: Information Gain = 0.2183974143767382 RPL36AP45: Information Gain = 0.21839248643525333 MIR6872: Information Gain = 0.21838972705956383 MAGEE1: Information Gain = 0.21838178270892317 CCDC200: Information Gain = 0.2183748526474505 CBX3P1: Information Gain = 0.21837426838595264 CALCB: Information Gain = 0.21836616291898636 LINP1: Information Gain = 0.21836282617181668 RPL32P16: Information Gain = 0.21835615511947637 PRL: Information Gain = 0.21835252990795495 PBX1-AS1: Information Gain = 0.21833044125730905 MTHFD2P7: Information Gain = 0.2183147582299656 FENDRR: Information Gain = 0.21831319798843984 FOXD3-AS1: Information Gain = 0.21830468054501773 RPL22P1: Information Gain = 0.21830459222605247 MIR193BHG: Information Gain = 0.21829333309872712 FNDC3CP: Information Gain = 0.21828682668042276 RNF213-AS1: Information Gain = 0.21827409992522973 ARHGEF18-AS1: Information Gain = 0.21827287662526262 ZNF221: Information Gain = 0.21827231506189393 EVX1: Information Gain = 0.21827164320004955 ROBO3: Information Gain = 0.21827131225526886 SNORA50A: Information Gain = 0.2182622739246205 RBMS1P1: Information Gain = 0.21826216030403556 GOLGA8H: Information Gain = 0.21826067768691892 MIR6836: Information Gain = 0.2182530447279316 LINC02895: Information Gain = 0.2182396768678867 GPR55: Information Gain = 0.21823519105439537 KRTAP1-3: Information Gain = 0.21822802749604064 TNNC2: Information Gain = 0.21822748024524752 APOB: Information Gain = 0.2182262672816926 PCNPP3: Information Gain = 0.21820850046769547 AFTPH-DT: Information Gain = 0.2182074089441126 ATP5F1EP2: Information Gain = 0.21820463867779827 EEF1A1P2: Information Gain = 0.21820187443796302 F8A3: Information Gain = 0.21819016413241865 HCG27: Information Gain = 0.2181786222225255 LINC02816: Information Gain = 0.218175315897281 VN1R83P: Information Gain = 0.2181508311070366 BHLHE41: Information Gain = 0.21814848844719914 APLF: Information Gain = 0.21814490759195926 SERPINA4: Information Gain = 0.21814382095197415 MMP21: Information Gain = 0.21814366815118835 MACROD2-IT1: Information Gain = 0.21814014814475824 TMEM132E: Information Gain = 0.21813339361559736 LBX1-AS1: Information Gain = 0.21813339361559736 BNC2-AS1: Information Gain = 0.21813339361559736 OXGR1: Information Gain = 0.21813339361559736 HTR5A: Information Gain = 0.21813339361559736 RNU6-460P: Information Gain = 0.21813339361559736 GTF2IRD2P1: Information Gain = 0.21813179945935568 CHST9: Information Gain = 0.21812952807000974 ZBBX: Information Gain = 0.2181250912745092 LINC02019: Information Gain = 0.21812410613795574 NPR3: Information Gain = 0.2181183761661456 LINC01311: Information Gain = 0.21811385143609296 PRSS29P: Information Gain = 0.21811179723687735 KRT8P4: Information Gain = 0.21809924248582013 DSC1: Information Gain = 0.21809819276053077 KAT7P1: Information Gain = 0.21809468035438706 RNVU1-2A: Information Gain = 0.2180803787682788 ANO7L1: Information Gain = 0.21806728999701308 RPS26P15: Information Gain = 0.21806094332365067 PRKN: Information Gain = 0.2180590403201117 INSC: Information Gain = 0.21805576379113223 HPCAL4: Information Gain = 0.218055763791132 CAHM: Information Gain = 0.21804945710717716 SLC12A4: Information Gain = 0.21804865565576725 COX6CP2: Information Gain = 0.21804092279527842 ZDHHC1: Information Gain = 0.21803640231081922 MBLAC1: Information Gain = 0.21802641993426253 CORO1A: Information Gain = 0.2180192096953517 MYL12BP2: Information Gain = 0.21801598616762097 CASS4: Information Gain = 0.21800941593882062 MTND4LP7: Information Gain = 0.21800450527468374 RN7SL89P: Information Gain = 0.2180030211750985 LINC00997: Information Gain = 0.21799841305767198 ZNF517: Information Gain = 0.21799416716280628 LRIG2: Information Gain = 0.21799385064446364 EPB41L4A-AS1: Information Gain = 0.21799380521396983 GUCY1B1: Information Gain = 0.21799098644710257 ACTR1AP1: Information Gain = 0.21798651838105076 PRRT4: Information Gain = 0.21797841583857447 LINC02443: Information Gain = 0.2179675260457885 ACTBP12: Information Gain = 0.2179627584178998 ANAPC1P2: Information Gain = 0.21795422077732418 PDE4DIPP7: Information Gain = 0.21794730435139353 NACA2: Information Gain = 0.2179458533101759 PRIM1: Information Gain = 0.21794423342980562 H2AZP1: Information Gain = 0.21793935033906453 ARHGAP26: Information Gain = 0.21793921789133708 TMEM145: Information Gain = 0.2179351426738667 KCNQ4: Information Gain = 0.21793071037078215 CCDC181: Information Gain = 0.21792565053926416 RPSAP6: Information Gain = 0.21791511681504216 RNA5SP437: Information Gain = 0.21791474534799105 MIR2110: Information Gain = 0.21791131283489218 RNFT1P3: Information Gain = 0.21790896797820736 SLC4A1: Information Gain = 0.21790860960916758 SNORD36B: Information Gain = 0.21790657770287458 MTND5P1: Information Gain = 0.21790217441974624 ADAM11: Information Gain = 0.217900822548714 EDIL3-DT: Information Gain = 0.21789339024035925 ANKRD18B: Information Gain = 0.217890608886822 TMPRSS11A: Information Gain = 0.21788520049182125 SMAD5: Information Gain = 0.21788326958240667 ZCCHC18: Information Gain = 0.21787973113883385 MBTPS1-DT: Information Gain = 0.2178758990730918 NRSN2-AS1: Information Gain = 0.21786473058161726 ZSCAN5C: Information Gain = 0.2178629868492914 DEFB1: Information Gain = 0.2178507108258405 DIAPH2-AS1: Information Gain = 0.21785050254334704 HOXB6: Information Gain = 0.21782942864708477 MIR4284: Information Gain = 0.21781454052775717 CFAP69: Information Gain = 0.21780666739021148 HNRNPA1P46: Information Gain = 0.21780654922229736 CCDC152: Information Gain = 0.21780554887691084 IL21R: Information Gain = 0.21780212465508142 IL21R-AS1: Information Gain = 0.21780212465508142 ANKRD20A19P: Information Gain = 0.21778157708434964 GRIA3: Information Gain = 0.21777715591485958 CCNJP2: Information Gain = 0.21777570694189174 CORO2B: Information Gain = 0.2177738286431925 MIR181B2: Information Gain = 0.21776179791172323 NOS2: Information Gain = 0.21776179791172323 THSD8: Information Gain = 0.21775805486931077 PTCH2: Information Gain = 0.21775187602888324 NIFKP4: Information Gain = 0.2177327470888053 NCMAP: Information Gain = 0.2177326643992088 ACTBP7: Information Gain = 0.2177245394008005 NME5: Information Gain = 0.21772034268951068 RNU6-1285P: Information Gain = 0.21771459540926097 TTC4P1: Information Gain = 0.2177111520455579 PMS2P11: Information Gain = 0.21770687355614093 FAM43B: Information Gain = 0.21770504074346153 GVINP1: Information Gain = 0.21769524175339505 MEF2C-AS1: Information Gain = 0.21769524175339505 MEGF10: Information Gain = 0.21769524175339505 FAM166C: Information Gain = 0.21769524175339505 PTCHD3P2: Information Gain = 0.21769524175339505 TRABD2B: Information Gain = 0.21769524175339505 KCNMB2: Information Gain = 0.21769524175339505 IGF1: Information Gain = 0.21769524175339505 RPL7P58: Information Gain = 0.21769524175339505 ROCR: Information Gain = 0.21768028659603478 VGLL1: Information Gain = 0.21767143468844852 ACTP1: Information Gain = 0.2176623004854099 BMP8A: Information Gain = 0.21765203522140641 ASTN2: Information Gain = 0.2176440629675691 LRFN2: Information Gain = 0.21763947325552468 CNTNAP3C: Information Gain = 0.21763602644129487 BCAS2P1: Information Gain = 0.2176331678422634 CICP13: Information Gain = 0.21762148741165266 LINC02463: Information Gain = 0.21762148741165266 ZNF658: Information Gain = 0.21761202126032564 TXNDC8: Information Gain = 0.21761104491430472 ABHD14A-ACY1: Information Gain = 0.21760553997468168 CDH17: Information Gain = 0.21760089054872056 DYNAP: Information Gain = 0.21759608975694467 LONRF3: Information Gain = 0.2175937375929331 LINC01091: Information Gain = 0.2175742816286872 PNPLA1: Information Gain = 0.21756303672029054 GCATP1: Information Gain = 0.2175603578098475 GNMT: Information Gain = 0.21755900046041998 SEC61G: Information Gain = 0.21755153594827803 SBK2: Information Gain = 0.21755134294672707 AOC2: Information Gain = 0.21755004037919967 TMEM169: Information Gain = 0.21753378052029637 ELAVL2: Information Gain = 0.21752450062181028 RTKN: Information Gain = 0.21749057805583671 CHID1: Information Gain = 0.2174877451201851 SLC4A1APP1: Information Gain = 0.21748081457130408 PICART1: Information Gain = 0.21746697910151802 PDC-AS1: Information Gain = 0.2174657670073521 CLDN14: Information Gain = 0.21746009395252686 SNORA63D: Information Gain = 0.21745070780632925 FBLN2: Information Gain = 0.21744665946254238 RPL23AP12: Information Gain = 0.2174426627006194 PDCL3P2: Information Gain = 0.21744024432997122 PTTG2: Information Gain = 0.21742561336448984 ADORA3: Information Gain = 0.21741106741858762 ARHGAP31: Information Gain = 0.21740721738668678 RNY3P15: Information Gain = 0.21740329119165258 DYNLT3P2: Information Gain = 0.2174017967647468 LIG1: Information Gain = 0.21739685362061656 ZFPM2-AS1: Information Gain = 0.2173953082072706 SELENOP: Information Gain = 0.2173840750436038 FBLN7: Information Gain = 0.21738316176911754 P2RX5: Information Gain = 0.2173460991662406 SPRY4: Information Gain = 0.2173388090199635 MIR6859-1: Information Gain = 0.21733845863488988 CSTA: Information Gain = 0.2173381692434344 JMY: Information Gain = 0.217317268231064 HCAR3: Information Gain = 0.2173170657126846 CGB3: Information Gain = 0.21731628475157816 KRT18P6: Information Gain = 0.21731542471894572 USP51: Information Gain = 0.21730187145994884 WASIR1: Information Gain = 0.2172956478638528 ACER2P1: Information Gain = 0.21728569062909053 MIR365A: Information Gain = 0.21728569062909053 CSMD2: Information Gain = 0.21728569062909053 ENPP7P7: Information Gain = 0.2172805720349278 RNU4-78P: Information Gain = 0.21727394192187055 CHST1: Information Gain = 0.21727211130840485 LINC00648: Information Gain = 0.2172658803271983 LINC01361: Information Gain = 0.21724816890456844 IQCN: Information Gain = 0.21724564707048488 MIR7851: Information Gain = 0.2172411817181541 C1QTNF1: Information Gain = 0.2172312127464111 SPATA45: Information Gain = 0.21722902238769626 PLCL2: Information Gain = 0.21721768031961752 FAM114A1: Information Gain = 0.2172102006624259 GATA1: Information Gain = 0.21720977322722357 CTBP2P8: Information Gain = 0.21719631765919267 ATP13A4: Information Gain = 0.2171879146570006 RPS17P5: Information Gain = 0.21718636244704737 PPP1R2: Information Gain = 0.21717598394837223 FYB1: Information Gain = 0.21717443696043826 RBMXP3: Information Gain = 0.2171690274613578 RNU6-481P: Information Gain = 0.217161188217176 C16orf96: Information Gain = 0.2171549757040716 CALM2P3: Information Gain = 0.21715096915861287 NEXN: Information Gain = 0.21714475496873087 ZXDA: Information Gain = 0.21714397383147133 TPRKBP2: Information Gain = 0.21714032767874847 DHX58: Information Gain = 0.21713986225106785 IL1A: Information Gain = 0.21713825445401058 C20orf144: Information Gain = 0.21713279815942932 C19orf71: Information Gain = 0.2171307984984321 MIR1234: Information Gain = 0.217128658196305 SLC38A3: Information Gain = 0.21712443341670062 LINC02904: Information Gain = 0.21712273001158078 PPIAP31: Information Gain = 0.217117795170638 RPL21P135: Information Gain = 0.2171126157766352 SASH1: Information Gain = 0.21710783857861604 U2AF1L5: Information Gain = 0.21710530777738501 NPAS2-AS1: Information Gain = 0.2170981403422616 RSPO1: Information Gain = 0.217095958518972 POU3F2: Information Gain = 0.2170921351338655 C8orf74: Information Gain = 0.21708881583051332 FRMPD1: Information Gain = 0.21708451944124074 LINC00942: Information Gain = 0.21708207429312543 KRT18P40: Information Gain = 0.21708192684648941 MIR600: Information Gain = 0.21707962143840742 DSEL: Information Gain = 0.21707107941994574 RMDN2-AS1: Information Gain = 0.2170698167966847 RNU6-455P: Information Gain = 0.21706655433673294 AGGF1P1: Information Gain = 0.21706107083108717 GAPDHP24: Information Gain = 0.21705570298510857 MT1L: Information Gain = 0.21705462773979955 LINC01907: Information Gain = 0.21705268279858636 CD4: Information Gain = 0.21704589468787883 PZP: Information Gain = 0.21704406451993918 SMPD4P1: Information Gain = 0.21703905476722762 EPCAM-DT: Information Gain = 0.21703556477388197 UBE2Q2L: Information Gain = 0.21699700471936012 NCF2: Information Gain = 0.21699500767689583 PAX7: Information Gain = 0.2169941699330693 IPO8P1: Information Gain = 0.21699229555721433 CCDC160: Information Gain = 0.21698744203768716 AKR1B1: Information Gain = 0.21698601667435335 KCNH6: Information Gain = 0.21696746214903317 RPS4XP19: Information Gain = 0.21696746214903317 RPL22P16: Information Gain = 0.21695703721499404 LINC02615: Information Gain = 0.21694994364489606 BOD1L1: Information Gain = 0.21694966425446638 DUTP7: Information Gain = 0.21694606014705475 RPS29P7: Information Gain = 0.21694109084101632 INSL6: Information Gain = 0.2169365667946841 AQP7: Information Gain = 0.2169332076885626 MIR3189: Information Gain = 0.21692797875573588 EVPLL: Information Gain = 0.21690969446397346 SLC19A3: Information Gain = 0.2168981123142968 RPS3AP29: Information Gain = 0.2168981123142968 LEF1: Information Gain = 0.21688886612548375 RPS17P1: Information Gain = 0.21688885922376477 TRAV27: Information Gain = 0.21688804437974185 MSLN: Information Gain = 0.21688036267199107 TRIM34: Information Gain = 0.2168734835488917 ICMT: Information Gain = 0.21685828852021616 HAS2: Information Gain = 0.21685563747627357 SNORD38A: Information Gain = 0.21684927635973095 TNKS: Information Gain = 0.21684218030101254 LINC02694: Information Gain = 0.21684217317607923 STX8P1: Information Gain = 0.21684064617325638 ST6GALNAC4: Information Gain = 0.21683235842904636 NME2P2: Information Gain = 0.21682738598047568 ARPP21: Information Gain = 0.21682738598047568 GRASLND: Information Gain = 0.21682738598047568 PAX2: Information Gain = 0.21682738598047568 RFTN1: Information Gain = 0.2168270497136553 VSTM2A: Information Gain = 0.21681812429843084 CTRB1: Information Gain = 0.21681325385211103 SCARNA1: Information Gain = 0.21679721191560986 PIH1D2: Information Gain = 0.21679568134850968 FAM13C: Information Gain = 0.21679250738999922 PLPPR3: Information Gain = 0.21678690600990658 PRDX3P2: Information Gain = 0.216780577697262 TMEM190: Information Gain = 0.21678047623642893 HMCN2: Information Gain = 0.21677562902456704 RNU6-1280P: Information Gain = 0.21677189747978431 KRTDAP: Information Gain = 0.2167663767282899 SNORA79B: Information Gain = 0.2167596882070819 PSMD7P1: Information Gain = 0.21675379694679697 PRKY: Information Gain = 0.21673719839028815 APOOP2: Information Gain = 0.21673169688305816 CCL26: Information Gain = 0.21671340302890196 YBX1P10: Information Gain = 0.2167050304783511 PTAFR: Information Gain = 0.2167008636923069 ZNF441: Information Gain = 0.21668938050400133 FAM87B: Information Gain = 0.216687403252668 TUBAP4: Information Gain = 0.21668501083537262 S100A3: Information Gain = 0.21668501083537262 GNG8: Information Gain = 0.21668445577425444 TAS2R13: Information Gain = 0.2166831108449494 SERPINA9: Information Gain = 0.2166781070095687 PPIAP85: Information Gain = 0.2166730939763537 ZBTB46: Information Gain = 0.21667174943989997 RPL31P63: Information Gain = 0.21667152383871824 LYPLA2P1: Information Gain = 0.21666571247365995 BLZF2P: Information Gain = 0.21666433455762224 EXOC3L2: Information Gain = 0.21666379254156598 SLC2A7: Information Gain = 0.21666253075649333 GASAL1: Information Gain = 0.21665781982847032 CENPF: Information Gain = 0.21665733164452217 NKX2-1: Information Gain = 0.21665453945103375 C9orf57: Information Gain = 0.21665277923668547 OR6K4P: Information Gain = 0.21665277923668547 PDGFRB: Information Gain = 0.21665277923668547 CTSLP2: Information Gain = 0.21665277923668547 FOXQ1: Information Gain = 0.21664506877130885 SERHL2: Information Gain = 0.2166383325714929 CATSPER1: Information Gain = 0.21663496701672336 KLF2P1: Information Gain = 0.21662778394108684 PHF3: Information Gain = 0.21660993884036572 TG: Information Gain = 0.21660962618720325 CCL4L2: Information Gain = 0.21660785340334843 CNTNAP3B: Information Gain = 0.21660785340334843 LINC00955: Information Gain = 0.21660357562735522 MIR1825: Information Gain = 0.21659394116965847 GAPDHP23: Information Gain = 0.2165865394437576 RPL10AP2: Information Gain = 0.21658364773153682 RBMX2P3: Information Gain = 0.21658148209533157 C1QTNF3: Information Gain = 0.21658039117602046 PNPO: Information Gain = 0.21657631170059966 NFYCP2: Information Gain = 0.21657588818150986 PPIAP40: Information Gain = 0.21657260309999593 MUC4: Information Gain = 0.21656663509097762 XKR7: Information Gain = 0.21656554954672913 KCNQ2: Information Gain = 0.21655936442579393 KIAA1210: Information Gain = 0.21655762575110993 RPL32P6: Information Gain = 0.21654826415155815 TMEM266: Information Gain = 0.2165439208826052 GALNT15: Information Gain = 0.21653850356861226 RPS15AP6: Information Gain = 0.21653737208098267 ZNF532: Information Gain = 0.21653480378179535 MIR4720: Information Gain = 0.21653237775056144 RPL21P93: Information Gain = 0.21652092269660095 SHISAL2A: Information Gain = 0.21652092269660095 KRT18P56: Information Gain = 0.21650922001917516 SPSB3: Information Gain = 0.21650793417807734 JAM2: Information Gain = 0.21650691501296926 SUMO2P1: Information Gain = 0.21650509258900796 FOXP1-AS1: Information Gain = 0.21648863675460728 INCA1: Information Gain = 0.21647406555754345 C20orf27: Information Gain = 0.21647024350939215 NAT8B: Information Gain = 0.21647017887045927 SARM1: Information Gain = 0.21645512167753278 ST3GAL1-DT: Information Gain = 0.21644936514781143 SEC14L5: Information Gain = 0.2164462997387222 MAGEC3: Information Gain = 0.21644012774805876 SHLD2P3: Information Gain = 0.21643905857136914 HMGN1P8: Information Gain = 0.2164264581144908 COL4A2: Information Gain = 0.2164242961901519 LINC00460: Information Gain = 0.21642129775520402 MIR3139: Information Gain = 0.21642129775520402 MYO1G: Information Gain = 0.2164212977552038 LINC02595: Information Gain = 0.21641577677632884 C1QL1: Information Gain = 0.21640352042790023 MIR155: Information Gain = 0.21640007946139717 MYBPC1: Information Gain = 0.21640007946139717 CDCP1: Information Gain = 0.21640007946139717 SFTPA1: Information Gain = 0.21639608380866737 ABHD12B: Information Gain = 0.21638984749788848 MYO7A: Information Gain = 0.21638886278064207 RPL13AP2: Information Gain = 0.21638683071813514 POLG-DT: Information Gain = 0.21638420393448476 KLK4: Information Gain = 0.21638320856719906 SPINK5: Information Gain = 0.21637761972647085 SLC9A9: Information Gain = 0.2163656275947996 DIS3L-AS1: Information Gain = 0.2163635392588319 C5orf46: Information Gain = 0.2163626948484998 RPL19P20: Information Gain = 0.2163626948484998 CNTN2: Information Gain = 0.21636269484849957 TSPOAP1: Information Gain = 0.21636269484849957 LINC01338: Information Gain = 0.21636269484849957 TRPM2: Information Gain = 0.21636238389005924 LINC00167: Information Gain = 0.21635729871562437 FBXL19: Information Gain = 0.21635459270943413 LINC00840: Information Gain = 0.2163521316286392 NBEAP1: Information Gain = 0.2163521316286392 KCNT1: Information Gain = 0.2163441403231796 GUCA1A: Information Gain = 0.2163441403231794 GPHA2: Information Gain = 0.216339823408791 SRMP2: Information Gain = 0.21633279535720473 NMD3P1: Information Gain = 0.21633065904017634 KIAA1217: Information Gain = 0.21632328879369345 CYP2T3P: Information Gain = 0.21632314954167065 AJAP1: Information Gain = 0.21631991173882437 APOBEC3B: Information Gain = 0.2163186220532538 SPAG16: Information Gain = 0.2163172603817849 BEAN1: Information Gain = 0.21630292201080858 OR7E22P: Information Gain = 0.21630029791383065 CYP3A7: Information Gain = 0.21629598377078874 CYP3A7-CYP3A51P: Information Gain = 0.21629598377078874 ZDHHC22: Information Gain = 0.21627412528759216 LINC02335: Information Gain = 0.21627241703674605 SLN: Information Gain = 0.21626422726858308 ITGA6: Information Gain = 0.2162604292188861 ENTPD8: Information Gain = 0.21625782872028543 FOXA3: Information Gain = 0.2162562359631235 OR52K3P: Information Gain = 0.2162562359631235 KRTAP9-12P: Information Gain = 0.2162562359631235 RPL36P2: Information Gain = 0.2162562359631235 RPS3AP26: Information Gain = 0.21624966639930632 TPBGL: Information Gain = 0.21623938491755035 SIRT4: Information Gain = 0.21623894208811412 LRRC4C: Information Gain = 0.21623640355147522 LINC01238: Information Gain = 0.21622738458704105 C22orf23: Information Gain = 0.21621872700990896 TPI1P2: Information Gain = 0.21621681299290518 LINC01186: Information Gain = 0.21621113925116742 RN7SL354P: Information Gain = 0.21620876978516534 CARNMT1-AS1: Information Gain = 0.21619759089851898 NMRK2: Information Gain = 0.216196679587068 RCC2P6: Information Gain = 0.21618804496408162 ZNF571-AS1: Information Gain = 0.21618788866126737 SEPHS1P6: Information Gain = 0.2161831626877997 AP1M2P1: Information Gain = 0.2161782987807559 CDC42-IT1: Information Gain = 0.21617318652173378 UFM1P2: Information Gain = 0.21617147432500072 SCN3B: Information Gain = 0.21616847077334733 PKNOX2: Information Gain = 0.21616833160306714 APOBEC3G: Information Gain = 0.21616833160306692 IRAK2: Information Gain = 0.21616470344363048 GALNT16: Information Gain = 0.21616270960972495 AGO4: Information Gain = 0.21615696643194093 POTEG: Information Gain = 0.21615365912147588 LINC00626: Information Gain = 0.2161351709659125 WFDC3: Information Gain = 0.21613172963483063 MYOM1: Information Gain = 0.216131183620351 CBX3P2: Information Gain = 0.21612198490105317 ZWINT: Information Gain = 0.21612163705223741 EEF1A1P1: Information Gain = 0.21611404024657488 OR10AC1: Information Gain = 0.21611353407991518 LIPM: Information Gain = 0.2161125669876125 RPL37P2: Information Gain = 0.21611112194050852 YPEL4: Information Gain = 0.21609674632994946 TCAF2C: Information Gain = 0.21609196036487743 PIGHP1: Information Gain = 0.21608638962336468 TBCAP1: Information Gain = 0.2160789881092926 MT-TG: Information Gain = 0.21606115393393188 C1GALT1C1L: Information Gain = 0.21605698749857183 BEX1: Information Gain = 0.21605639134333754 C1QL4: Information Gain = 0.21605120497170605 DUSP5-DT: Information Gain = 0.21604799944094966 KRT15: Information Gain = 0.21602991152289963 CMPK2: Information Gain = 0.21602780886734152 ADRA2B: Information Gain = 0.21602372410086468 CXCL8: Information Gain = 0.21601906871373422 COP1P1: Information Gain = 0.21601763841439903 SMYD3-AS1: Information Gain = 0.21601763841439903 ODF3: Information Gain = 0.21601763841439903 VSTM4: Information Gain = 0.21601763841439903 BTF3L4P1: Information Gain = 0.21601763841439903 ARMC3: Information Gain = 0.21601559310011642 SEMA7A: Information Gain = 0.21601409339919542 MIR1972-1: Information Gain = 0.21601147735669413 RNU2-27P: Information Gain = 0.21600509723673267 PRKCQ: Information Gain = 0.2160037107105981 RPL32P27: Information Gain = 0.2160003712937637 RNA5SP141: Information Gain = 0.2159912610978818 HLA-DMB: Information Gain = 0.2159778504013632 MIR3621: Information Gain = 0.21596837320129003 ITPRIP-AS1: Information Gain = 0.21596728183526515 P3H4: Information Gain = 0.21595315207068877 NCR3: Information Gain = 0.21595142589755456 LINC01228: Information Gain = 0.21594857070292628 LINC00494: Information Gain = 0.21594432224950189 ESYT3: Information Gain = 0.21593762453227483 EEF1A1P11: Information Gain = 0.21593508375865134 PTGIS: Information Gain = 0.2159311643536781 RSL24D1P1: Information Gain = 0.2159311643536781 CHMP5P1: Information Gain = 0.21592633137020578 EGR2: Information Gain = 0.21592460546702408 PTPRC: Information Gain = 0.2159245967865473 LINC01114: Information Gain = 0.21592219939490964 HOXD8: Information Gain = 0.2159221233248223 RNY1P15: Information Gain = 0.21591479646029144 KIAA0408: Information Gain = 0.2159135262903924 TFGP1: Information Gain = 0.21589767904656165 PPP4R1-AS1: Information Gain = 0.2158809958932102 ACTG1P3: Information Gain = 0.2158755353465167 LINC01933: Information Gain = 0.21587522521546054 CCL3: Information Gain = 0.21587522521546054 TUBBP2: Information Gain = 0.2158713915159871 FRMD5: Information Gain = 0.21587020935345214 SGCD: Information Gain = 0.21586907805741462 ARPP19P1: Information Gain = 0.21586066634287016 MIR6740: Information Gain = 0.21585928073887484 PEG10: Information Gain = 0.21585729742333282 HMGB1P3: Information Gain = 0.2158533189174361 RPSAP69: Information Gain = 0.2158390261892571 RSL24D1P6: Information Gain = 0.21583400690308951 SUMO2P6: Information Gain = 0.21582896398399676 MIR5006: Information Gain = 0.21582758688149295 TNIP1: Information Gain = 0.21581936250946887 SNHG28: Information Gain = 0.21580728678982042 RNA5SP37: Information Gain = 0.21580519018581956 RBM11: Information Gain = 0.21580354762818077 PRKAG2-AS1: Information Gain = 0.2158016460513983 RN7SL775P: Information Gain = 0.21579747422632778 IL11RA: Information Gain = 0.21579655757742167 LINC01305: Information Gain = 0.2157928794167434 ATP6V0E1P3: Information Gain = 0.2157873208554042 RN7SL4P: Information Gain = 0.21578726975650664 CRBN: Information Gain = 0.21578010092751088 MON1A: Information Gain = 0.2157798877230519 CCR2: Information Gain = 0.2157663436785724 SLC6A20: Information Gain = 0.2157663436785724 LINC02533: Information Gain = 0.21574800250298032 LINC01362: Information Gain = 0.21574736432810604 COL7A1: Information Gain = 0.2157449070627555 SNORD3B-1: Information Gain = 0.2157440983658201 DEPDC1P1: Information Gain = 0.21573862746648498 RASAL2-AS1: Information Gain = 0.21573243860861413 SNORD54: Information Gain = 0.21572836029175013 ACSM4: Information Gain = 0.21572196011239186 OR7E90P: Information Gain = 0.2157180145219788 H3P47: Information Gain = 0.21571264768917553 SETP22: Information Gain = 0.21571140345353745 VEGFD: Information Gain = 0.2157022198968599 GPBAR1: Information Gain = 0.21568759149041505 RN7SL466P: Information Gain = 0.2156842868152249 ABCB10: Information Gain = 0.21567260378733755 SCML2P1: Information Gain = 0.21566834548496816 ATP6V0E1P2: Information Gain = 0.21566745127350928 C1orf94: Information Gain = 0.21566745127350928 GCM2: Information Gain = 0.21566745127350928 SDR9C7: Information Gain = 0.21566745127350928 MAS1: Information Gain = 0.21566745127350928 FNDC7: Information Gain = 0.21566745127350906 NACAD: Information Gain = 0.21566745127350906 IFFO1: Information Gain = 0.21566560422389336 SPANXB1: Information Gain = 0.21566444905804527 PTMAP1: Information Gain = 0.21566247595897003 LINC02300: Information Gain = 0.21565966888421562 SRCIN1: Information Gain = 0.21565181144400714 OGFRP1: Information Gain = 0.215641550049525 TMEM121B: Information Gain = 0.21563387494107178 CATSPER3: Information Gain = 0.2156334061293752 LINC01978: Information Gain = 0.2156294927027238 RPS8P4: Information Gain = 0.21562250018029872 EVI2B: Information Gain = 0.21562250018029872 HES7: Information Gain = 0.21562081301757385 ZFP37: Information Gain = 0.21562081301757385 ALDH3B1: Information Gain = 0.21561996399684324 MIR544B: Information Gain = 0.21561977544617106 RPL7P9: Information Gain = 0.21561413500295434 KLHL38: Information Gain = 0.2156076784358183 RNU1-134P: Information Gain = 0.2156005785984232 RN7SL443P: Information Gain = 0.2156003550563632 G0S2: Information Gain = 0.2155996871303587 SLC7A9: Information Gain = 0.21558922571811556 PCSK1: Information Gain = 0.2155871293646363 DIRAS3: Information Gain = 0.21557954872635166 MIR23A: Information Gain = 0.21557954691345316 FAM157A: Information Gain = 0.21557943321630502 UPK3A: Information Gain = 0.21557600613118066 SLC9A7P1: Information Gain = 0.2155662489514356 RHEX: Information Gain = 0.21556099238813298 FLNC: Information Gain = 0.21556099238813298 SNORA20: Information Gain = 0.21556099238813298 KRT8P27: Information Gain = 0.21556099238813298 UQCRBP2: Information Gain = 0.21554251445118844 DNAJC28: Information Gain = 0.21553990944437684 WWP1P1: Information Gain = 0.215531894883789 SNORD52: Information Gain = 0.21553048377285755 CLLU1: Information Gain = 0.21552360777523538 MIR4513: Information Gain = 0.21552360777523538 DDX12P: Information Gain = 0.21552268050626577 HSPA2-AS1: Information Gain = 0.21552169521590914 CCND2-AS1: Information Gain = 0.21550151424321173 CCND2: Information Gain = 0.21550151424321173 RPL26P30: Information Gain = 0.21549808425176176 TNFAIP8: Information Gain = 0.21549647799360794 RGMA: Information Gain = 0.21549516085285414 ARHGAP44-AS1: Information Gain = 0.21548907063899603 MIR548O: Information Gain = 0.21548365124793722 MIR933: Information Gain = 0.21548365124793722 MIR6165: Information Gain = 0.21548365124793722 ENPP2: Information Gain = 0.21548365124793722 RNU7-40P: Information Gain = 0.21548365124793722 LINC02679: Information Gain = 0.21548365124793722 BRWD1-AS1: Information Gain = 0.21547491757639037 MIR34A: Information Gain = 0.21547456077307325 NOTO: Information Gain = 0.2154705606748415 SNORD70B: Information Gain = 0.21546900484678755 SEPTIN7P7: Information Gain = 0.21545738834214778 MYBL2: Information Gain = 0.21545168080328247 LRIG2-DT: Information Gain = 0.21544585358060364 RPP25: Information Gain = 0.2154426044828679 MIR30B: Information Gain = 0.21544112280623806 ZNF826P: Information Gain = 0.21543304728771262 RDM1P1: Information Gain = 0.21542401064573413 MIR6810: Information Gain = 0.21542311894423238 POLH-AS1: Information Gain = 0.2154061844830839 FZD1: Information Gain = 0.2153945549010603 RPL12P47: Information Gain = 0.21538934958283362 RPS7P14: Information Gain = 0.21538843847194844 RNU6-29P: Information Gain = 0.21538315376967576 C1GALT1: Information Gain = 0.21537339361129915 BZW1P2: Information Gain = 0.21537087391017362 RPL13AP7: Information Gain = 0.21537008554428994 PRAM1: Information Gain = 0.2153690548236149 EIF2S2P4: Information Gain = 0.2153689855072245 RBPMS2: Information Gain = 0.21536643305493475 SOX10: Information Gain = 0.21536639708175387 LINC00640: Information Gain = 0.21536516653982063 FAM133FP: Information Gain = 0.21536385292431137 FAM217A: Information Gain = 0.2153609708686297 LINC01068: Information Gain = 0.2153597860133123 LINC01864: Information Gain = 0.2153584948842573 MTATP8P2: Information Gain = 0.2153584948842573 ITGB1: Information Gain = 0.21534984870069795 HLA-DRB1: Information Gain = 0.21534648966546888 HSPA8P16: Information Gain = 0.21534522714072213 KLHDC7B-DT: Information Gain = 0.21534522714072213 ST18: Information Gain = 0.21534522714072213 LINC02223: Information Gain = 0.21534522714072213 COX6B1P4: Information Gain = 0.21534522714072213 HNRNPA1P47: Information Gain = 0.21534254161732047 NT5M: Information Gain = 0.21533013803377 OR7E37P: Information Gain = 0.21532706534333412 MIS18A-AS1: Information Gain = 0.21532622824636283 LINC02269: Information Gain = 0.2153224664564486 SLC4A9: Information Gain = 0.21531919577989522 ADCY5: Information Gain = 0.2153171104449878 MYCNUT: Information Gain = 0.21530784252782476 IL17REL: Information Gain = 0.21529819893515745 IGHV4-34: Information Gain = 0.21529761578398343 MAD2L1-DT: Information Gain = 0.21529743416387181 H3P11: Information Gain = 0.21529727930796416 RPL31P7: Information Gain = 0.21529727930796416 NLRP3: Information Gain = 0.21529727930796416 IGSF22: Information Gain = 0.21529436624562015 HMGA1P7: Information Gain = 0.21529353832321063 KRT85: Information Gain = 0.21529302854510335 KCNC2: Information Gain = 0.21529202187016172 SLC25A27: Information Gain = 0.2152909380685033 LST1: Information Gain = 0.21528928800250458 CICP9: Information Gain = 0.21528928800250458 TNFAIP6: Information Gain = 0.21528928800250458 FGG: Information Gain = 0.21528928800250458 LYG2: Information Gain = 0.215287932522368 FABP6-AS1: Information Gain = 0.21527444099815063 NOG: Information Gain = 0.2152642343375113 RP9: Information Gain = 0.21526144821718018 CLDN11: Information Gain = 0.21524806969013377 ANGPTL2: Information Gain = 0.21524806969013377 CSF3R: Information Gain = 0.21524806969013377 LINC01749: Information Gain = 0.21524806969013377 PRKAR2B-AS1: Information Gain = 0.21524806969013377 LINC00608: Information Gain = 0.21524806969013377 VAX1: Information Gain = 0.21524806969013377 RPL23AP35: Information Gain = 0.21524806969013377 CALCA: Information Gain = 0.21524806969013377 DBIL5P2: Information Gain = 0.21524806969013377 LYPLA1P3: Information Gain = 0.2152352712878851 MEAF6P1: Information Gain = 0.21523259620508028 ZMYND10: Information Gain = 0.21522644829480164 SLC8A3: Information Gain = 0.21522116281507486 DLG5-AS1: Information Gain = 0.21522008324514252 PDE1A: Information Gain = 0.21520172516904257 TRIM67: Information Gain = 0.21520138364244823 MEDAG: Information Gain = 0.21520138364244823 ITPRID1: Information Gain = 0.21520138364244823 YY2: Information Gain = 0.21519960163216556 RN7SL166P: Information Gain = 0.21518246758704596 UBE2S: Information Gain = 0.21518216374144994 TBPL2: Information Gain = 0.2151738344824239 CENPK: Information Gain = 0.21516974929627541 TMCO2: Information Gain = 0.21516542276787964 MMP10: Information Gain = 0.21516542276787964 KCTD9P1: Information Gain = 0.21516542276787964 WDHD1: Information Gain = 0.21515158381064592 SNORA73B: Information Gain = 0.21514902917257417 MEFV: Information Gain = 0.21514830469593993 PSMD8P1: Information Gain = 0.2151424640596249 YIPF7: Information Gain = 0.21514145916414318 MINAR2: Information Gain = 0.21513821768442343 ABCC6P2: Information Gain = 0.2151342197652386 ISOC2: Information Gain = 0.2151238489927907 TXNP5: Information Gain = 0.21511432164226507 PLAT: Information Gain = 0.2151134792823921 JAG1: Information Gain = 0.21510742884611989 LINC01185: Information Gain = 0.21510487966640413 TTYH2: Information Gain = 0.21509876120520421 CGB7: Information Gain = 0.21509323712432638 LINC02068: Information Gain = 0.21509081549534104 LINC01701: Information Gain = 0.21507708632841593 CALHM3: Information Gain = 0.21506968787240432 RPL37A-DT: Information Gain = 0.21506849639170844 ME3: Information Gain = 0.2150566081380605 CNTNAP3P1: Information Gain = 0.21505448279227068 ITGA6-AS1: Information Gain = 0.21505419393390768 PIGM: Information Gain = 0.21505395033764718 RPL7AP11: Information Gain = 0.21504315318763223 SERHL: Information Gain = 0.21503936649479694 LINC02052: Information Gain = 0.21503356622779557 NIFKP8: Information Gain = 0.21503356622779557 ACTN3: Information Gain = 0.21503356622779557 C20orf202: Information Gain = 0.21503356622779557 MAPK4: Information Gain = 0.21503356622779557 UROC1: Information Gain = 0.21503356622779557 OLFML2A: Information Gain = 0.21502965780147365 RN7SL253P: Information Gain = 0.21502811148425383 NFYBP1: Information Gain = 0.21502418479333896 HHIP-AS1: Information Gain = 0.21501123753541784 DKKL1: Information Gain = 0.21500896678703274 LINC00865: Information Gain = 0.21500001897198406 CCDC69: Information Gain = 0.214989132488141
filtered_genes1 = list(filter(lambda gene: information_gain[gene] > 0.215, sorted_genes))
r = len(filtered_genes1)
print(r)
3414
def plot_info(len_data, data):
plt.figure(figsize=(10, 6))
plt.bar(range(len_data), information_gain[data])
plt.xlabel('Genes')
plt.ylabel('Information Gain')
plt.title('Information Gain for Selected Genes')
plt.tight_layout()
plt.grid(False)
plt.show()
plot_info(r, filtered_genes1)
This is the list of selected genes:
for x in filtered_genes1:
print(data.columns[x])
NDRG1 BNIP3 HK2 P4HA1 GAPDHP1 BNIP3L MT-CYB MT-CO3 FAM162A LDHAP4 ENO2 HILPDA ERO1A PDK1 PGK1 VEGFA C4orf3 LDHA KDM3A DSP PFKP PFKFB3 DDIT4 PFKFB4 GAPDHP65 CYP1B1 GPI MTATP6P1 CYP1B1-AS1 AK4 IRF2BP2 BNIP3P1 MT-ATP8 MXI1 MT-ATP6 TLE1 FUT11 RIMKLA UBC IFITM2 CIART TES HK2P1 HIF1A-AS3 GBE1 MYO1B GAPDH P4HA2 SLC2A1 PGK1P1 ITGA5 NFE2L2 ALDOA RSBN1 MT-TK EIF1 FDPS STC2 DYNC2I2 MT-CO2 PGAM1 TMEM45A ENO1 ALDOAP2 PTPRN MIR210HG RUSC1-AS1 FOSL2 C8orf58 PYCR3 ELOVL2 RAP2B HLA-B BHLHE40 RIOK3 BHLHE40-AS1 KRT80 SOX4 P4HA2-AS1 CYP1A1 USP3 SNRNP25 TNFRSF21 TANC2 PSME2 GAREM1 IER5L AK1 WDR45B EGLN3 PGK1P2 EGLN1 GAPDHP72 PGP CEBPG SPOCK1 IFITM3 DAPK3 GNA13 HLA-C ACTG1 NAMPT DSCAM-AS1 CLK3 SLC9A3R1 PNRC1 IGFBP3 SPRY1 MIR6892 NEBL BBC3 PGM1 ADM QSOX1 DARS1 MKNK2 SLC27A4 EML3 EMP2 SDF2L1 ST3GAL1 TGIF1 GAPDHP70 MRPL4 DAAM1 LY6E IDI1 TST SLC9A3R1-AS1 IFITM1 HNRNPA2B1 CCNG2 TRAPPC4 VLDLR-AS1 GAPDHP60 LSM4 NCK2 ARPC1B GABARAP LDHAP7 TSC22D2 PRELID2 MSANTD3 RAD9A POLR1D MIR3615 CA9 PSME2P2 MKRN1 CTPS1 NTN4 NDUFS8 LDHAP2 NDUFB8 ZNF292 SRM BTG1 OSER1 ELF3 CTNNA1 RNF183 DHRS3 MIR7703 KCMF1 FTL C2orf72 DDIT3 STK38L SMAD2 EGILA SMAD9 IL27RA FAM110C RBPJ ESYT2 TUBD1 ZNF160 PKM TGFBI TMSB10 MACC1 PAM IGDCC3 ZYX HMOX1 HELLS SFXN2 FNIP1 GAPDHP61 TPD52 CRELD2 TXNRD1 RORA WASF2 RAMP1 RND3 ZNF395 FYN GAPDHP63 UHRF1 TUBG1 EIF4A2 KLF3 RHOD DAPP1 AVL9 SLC3A2 TFG TCAF2P1 RCAN3 PPP1CA MIR5047 LRR1 YEATS2 MYL12A BEST1 CLDND1 NUPR1 ARFGEF3 FTH1 HMBS DUSP10 ALOX5AP VLDLR SINHCAF RPL17P50 RNF19B ZFAS1 FASN PGM2L1 RRAGD MYRIP GGCT KLF3-AS1 DCXR TLE1P1 CDC42EP1 RPL34 PCAT6 EBP DUSP4 CHD2 ANGPTL4 RUNX1 INSIG2 PHLDA3 GAPDHP40 RANBP1 POLR2L RNASE4 DNPH1 HPDL POP5 ATP5F1D THAP8 WEE1 CCNI SLC29A1 TRIB3 KLF7 FOXO3 PSME2P1 GNAS-AS1 FAM220A ZNF12 NUDT5 MFSD3 ANG DOK7 PRMT6 FBXL6 ELOVL6 VDAC1 STRA6 ASNSP1 HNRNPAB CAPN2 SLITRK6 GRB10 FEN1 FBXO42 SLC25A36 CDC42EP3 GET1 PCBP1-AS1 FOXO1 HEY1 FAM13A BCL10 FBXO16 PDZK1 PTGER4 TFRC KDM5B GINS2 VPS37D ADCY9 LRATD2 NDUFC2 NECAB1 TKFC TRIM16 CDC45 LINC02649 TMEM265 EDN2 DENND11 SRF GPS1 FAM13A-AS1 PDLIM5 KLHL2P1 ATP5MC1 ZBTB21 CFD EMX1 PLBD1 PTPRH ATP5F1E APEH TCAF2 MAP1B TMEM64 NECTIN2 NDUFS6 TMEM123 CERS4 LDHAP3 CD55 EIF4EBP1 PAGR1 ADAMTS19-AS1 SEC31A FADS1 GPNMB MSANTD3-TMEFF1 CHMP4C TMEM65 IMMP2L RLF GAD1 SDAD1P1 ANKRD12 SNX27 RPL21 ASF1B C1QBP DHCR7 FADS2 ACLY CENATAC-DT FTH1P16 H2AX VEGFC LOXL2 MYO1E CCDC28B TUFT1 GAPDHP21 MOV10 BCL2 FLRT3 CBLB TRABD2A MYO10 MPV17L2 NDUFB1 WSB1 TEDC2 SDR16C5 OLFM1 KLF6 KPNA2 CEACAM5 PHTF1 ZNF84 SYT12 DHRS11 FDFT1 MYCBP AZIN1 MYH9 ACOT7 DBI TTC9 PPP1R10 MMP16 SLC25A10 SH3GL3 PSAP DMRTA1 ATXN1-AS1 UNC5B-AS1 LIMCH1 FANCG AGPS BCAS1 DGKD ARL8A KCNK5 PCAT1 MEIKIN TPT1-AS1 CDK2AP1 ATXN1 GPR179 IFFO2 KLF11 ACAT2 PCP4L1 GPR146 MB BEND5 BCL2L12 COPS9 DOLK PCBP1 ELOVL5 SHISA5 PLOD2 CSNK1A1 RNF149 ATAD3A ATF4 RPL31 PALLD PLOD1 C1orf116 ADGRF4 HLA-W GYS1 TMOD3 KCNG1 TPX2 PTEN TAF9B BOD1 EDA2R CHRNA5 HSD17B10 MALL HAUS8 GADD45A B4GAT1 ARF6 ZFAND1 RAB6A USP3-AS1 ELL2 RET ATF2 WDR45BP1 SIKE1 KRTAP5-2 PLIN5 GAS5 LRIG3 NRP1 GFRA1 CHAC2 ATXN3 TMEM104 ANKZF1 ULBP1 MICB IFI35 HLA-E PIK3R3 NFIL3 PHF19 CLVS1 ATP1B1 CDC25A IDI2-AS1 NDUFC2-KCTD14 KLHL24 FBXO32 TMEM229B TSPAN4 FCGRT RAP1GAP FAM167A ENDOG TMEM59 MVK GAPDHP71 POLR3K S100A13 FBXO38 LDLRAD1 MT-CO1 LAMC2 PPFIA4 ANXA1 GDF15 IL3RA GPAT3 SPC24 UBE2QL1 MIR6728 MALAT1 PLAAT2 ACTG1P10 MYL12-AS1 GOLM1 MIR1199 EIF4B CYB561A3 PPM1K-DT MRPL28 CDCA7 CCDC74A SLC25A39 C4orf47 ABHD15 ADM2 PYGL FRY FUOM FTLP3 GPER1 ZNF689 GALNT18 RPS27 MIR181A1HG POLA2 SCEL FAM47E-STBD1 INSYN1-AS1 SAT1 FOXP1 SLC25A35 HLA-T C6orf141 SERGEF TRIM29 HAUS1 SPRR1A APOBEC3A SNTB1 RNF19A YEATS2-AS1 ATIC TMEM54 CENPM P3R3URF-PIK3R3 GPR155 RYR2 SERINC3 CD9 CCN4 MAOB RPL7 TNFRSF19 LDHAP5 LRP4 LPP LNPK NDUFA4L2 CAST CISD3 CCSAP NAPRT METTL7A CPEB2 WDR4 FTH1P20 TBC1D8B SCARB1 FAM210A PLD1 CDK5R2 MTHFD1 XPOT PPP1R3C MCM3 RPL23AP7 PPP1R14C TPD52L1 UNC5B FUT3 JPH2 SAMD4A IGFLR1 MUC16 HLA-L MRNIP ZNF365 RCN1P2 RAPGEFL1 ADAT1 HINT3 SLC7A11 RIBC2 SAMHD1 GAL CXADR HSD17B1-AS1 SMAP1 ELOVL2-AS1 LOX SHMT1 KRT83 NUP62CL SPATS2L RECQL4 TKT PWWP3B INSYN1 A4GALT STING1 KRTAP5-AS1 SRPX TBC1D3L AGMAT FRK LATS1 KRT224P GRM4 HOXA10 PDGFB EIF2B3 PACSIN2 PPM1J ST8SIA6-AS1 RNPEP CBX5 PNMA2 ANXA2R PAK6 GAPDHP73 EGFR FAM111B CDKN2AIPNL SOGA3 MCM10 CD109 CDC20 AHR HOXA13 KMT5B GAPDHP64 C15orf65 FAM214B SLC25A15 S100P GAPDHP69 RIPPLY3 RAB3IL1 ALDOAP1 MCRIP2P1 SLC26A5 SQSTM1 TCP11L2 NDUFB10 POMGNT1 WDR76 CHTF8 OTULINL LRATD1 WDR61 TTC36 DPF1 CFDP1 ETNK2 MIR7844 PARP1 ADGRF1 IRF6 LINC00623 MTCO3P12 GAPDHP35 MFSD13A ARMC6 GET1-SH3BGR CD320 MTHFD2 VAPA MIF ZNF367 ZNF148 SEMA4B NECTIN3-AS1 PCCA-DT KCND3 CAVIN1 ATP5F1A PCLAF DAPK2 SLC1A1 DCAF10 E2F2 GAS5-AS1 PPP1R14B-AS1 XPOTP1 H3C4 MRPL38 GOLGA6L10 NRGN DTL HSD17B1 RGCC AIFM1 SNHG22 MRPL41 NT5DC2 CYP4F22 BEST4 NKAIN1 POLD1 TUBA3E KLF13 LINC01214 GIHCG STXBP5-AS1 CDKN3 TARS1 APOL4 H4C5 ZNF337 DHCR24 PPP2R5B PARK7 CLPSL2 RTN4RL1 RNF144A FAM86C1P AKR1C1 H2AC7 EDN1 CBX4 MIF-AS1 MAP4K2 COA8 IFI30 BRCA1 GON7 RBBP7 SORL1 BSCL2 KRT4 FGF2 CDK5 DMC1 TUBA4A FKBP5 CCDC107 H2AC9P TMEM74B NPC1L1 NDUFA4 DRAXIN TMEM19 BMF PLEKHG1 RNF180 HYMAI IFI44 ARID5A PLK1 CEACAM6 DNASE1L2 EEF1A1 TPSP2 STBD1 ZNF528-AS1 CYRIA ENO1P1 ITGB3BP HDHD5-AS1 TNFRSF18 SPATA18 TLCD1 SNTA1 MED15 ZNF682 AZIN2 HEATR6 ENOX1 RNU1-82P ADRA2A CCDC33 AMPD3 TNFRSF6B HIGD1AP1 PLEKHO1 TLE6 ACTBP15 MITF PKDCC ARFRP1 FTH1P12 MIR210 MEF2A REEP2 OTX1 VXN SLK PARM1 TSPAN12 NIBAN1 TOX2 CFAP418-AS1 MYBL1 MIR34AHG SINHCAFP1 GLUD1P3 FTH1P15 ANAPC5 G6PC3 CASTOR3 BTG1-DT TPM4 CYFIP2 DPAGT1 GATA2 ASNS SEL1L RUSC1 RN7SL674P RCN3 CALM3 ABHD8 LPIN3 ZMPSTE24-DT DNAAF10 SNW1 S100A4 LSS DSC2 EGFR-AS1 DUSP2 MLKL C21orf58 CRYBG3 POLE2 STX3 LERFS EXOG TOP2A PLBD1-AS1 NAV1 ATP6V1G1 TK1 CFAP251 TPTE2 CAVIN2 KRT19 CLEC3A RELN EGR3 HMGN3 HES2 DUSP8 KIF5B MCM6 HOXA10-AS EFEMP2 CALR4P DNER BMF-AS1 GAPDHP68 SERPINE2 FBP1 BMS1P10 KRT18P46 MMP13 GAPDHP32 ADAMTS9-AS2 KBTBD2 SERTAD2 RGS20 C2CD2 MIR7113 PPP1R3E ARID3A ERICH6-AS1 STAG3 RAMP2 LRP4-AS1 GPR139 SYNE3 CPA6 GLRA3 ERLNC1 EEF1A1P13 WSCD1 PTTG1IP SDK1-AS1 FLOT2 MFSD11 TOX3 PLXNA2 TNNT1 PHLDB2 LIN7A IDS ANXA3 SCGB2A1 DHX40 GLIDR IL17RB KRT16 ANK2 CHAF1B ZMAT4 CYB5B SRD5A3-AS1 SLC47A1 SPA17 LRP2 ACTG1P12 SMIM15 NAXE ZNF524 THEG RANGRF FNDC10 ISOC1 TRIM16L GPRC5A MID1 ERRFI1 CCDC71 MLEC TONSL CCR3 COL9A2 C1QTNF6 COL17A1 TM7SF2 SYNGR3 KHDC1 RGS17 C1R ACSS1 TENM3-AS1 SERINC1 LINC01659 FOXRED1 MUC12-AS1 FTH1P7 HERC3 TATDN1P1 KRT17 NUAK1 PGLYRP2 MCUB MYORG ACTR3C TMCC3 NPY1R LRRC45 BLNK NAMPTP1 MIR3917 CSTF3 FOXP2 FOXI3 GAPDHP44 YPEL5 RN7SL1 PRKAA2 SPATA12 PTPRR COQ4 DPCD CCND3 ARHGEF28 MKRN4P TMEM45B ATP6AP1L MIR6819 FTH1P8 SBK1 SUOX MEAF6 MAGEF1 ATP5MG RBP7 MAB21L3 GALR2 WASF4P ARL6IP1P2 SARS1 MIR6811 ZNF766 DOCK11 CHST14 NUDT6 ECI1 SOWAHC TOMM40P2 SEPHS1P4 RPS12P26 HSPB1P2 LONRF2 THEMIS2 CNPY4 DTYMK ABCB8 TMEM132B HS6ST3 SOD2-OT1 ID2-AS1 ETV6 CCDC74B DPT CSGALNACT1 KCNN1 ZNF70 TIGD3 RHPN1-AS1 MALRD1 KRT89P DACT3-AS1 PPP1R3B CHAC1 ATG14 SEPSECS-AS1 ARHGEF35-AS1 IL17D STMN4 DEPDC4 GINS1 MRTFA MUC5B-AS1 LRG1 AXL MCOLN3 OR2A9P TNFRSF10B MELTF PTH1R ZNF264 RTL8B MIR6830 DTNA PKD1P6 OPLAH FGD2 SUMO3 IGHE ANXA2 CDYL LINC01615 MRPL12 ASPM CDC6 GTSE1 IFNAR2 FAS UMODL1 SH3RF2 DIPK2A E2F1 CORO1C CDC42EP2 RUNX2 CCL22 MDK MIR4743 GRPEL2 PALM2AKAP2 RAB37 SVIL MAP7D2 PPP2CA-DT NAGS EMID1 C1QTNF7-AS1 GREB1 RNF41 NUDT1 SOX11 IFRD1 PPP1CB CDH11 MIR761 ZBTB20-AS1 ZDHHC9 PDGFC ADPRH CPLANE2 RNU6-8 CYBA TMCO3 RFX3-AS1 S1PR5 PKD2 FTH1P11 GOLGA2P5 ZNF610 MIR3198-2 DSCAM SMARCE1P5 LIF CAVIN2-AS1 LINC00526 CHML SPTBN4 LINC00598 LNC-LBCS C12orf60 CLGN ARL2BPP4 KCTD11 CXCR4 ASPH KIF4A SKA3 HS3ST1 C19orf38 GRIN2C CDKL2 SPRR1B CENPX DRAIC NCMAP-DT PAOX YBX2 SEPTIN11 FCHO2-DT LNX2 ZRANB1 NEK9 CEP19 LPAR3 NR3C1 WEE2 STMN1 OTOS MIF4GD NPEPPSP1 FAM177B SIPA1L2 TMEM105 LINC02889 ANKRD22 PXDC1 GAMT ISM2 TMPRSS9 FTH1P2 ARHGEF34P GDAP1 NF2 SPRED1 BTC TRIM60P18 MEX3D IFI16 GDPD3 NAV2 MIR636 HSD17B14 CLPSL1 KCNJ8 GSC PCAT7 LINC00636 PRRC1 HSH2D TIMELESS CREB5 TRAV18 PHC2-AS1 PTGFRN PRELID1 SEMA6C PAG1 OR7E39P GLT1D1 AGBL2 FAM178B ST13P6 LHX2 ZNNT1 HSPB1P1 CORO1A-AS1 THRIL SNRPGP15 C2CD4C DDX59 NPY5R FYB2 MAP1A COL13A1 ID4 IL12A-AS1 TAGAP-AS1 LINC00824 GOLGA5 GCNT3 OR7E126P FDX2 KCTD17 PRICKLE2-DT GBX2 EDARADD IL20 FAM230I MIR6785 RPL7P6 NUSAP1 CMKLR2 LRRC3 MAF C14orf132 TNIK DINOL DNAH10OS ARIH1 FGF13 RPL7P47 SWAP70 HS6ST2 LINC01977 LINC00629 LINC00866 MIR6765 ZNF304 PEX5 THRSP FTH1P5 CDKN1A STAB1 PHGDH LINC01340 MCM7 ALOX5 ZMYM5 DCLK2 ECPAS ABHD4 RPL4P6 FGFR4 KLKP1 SUMO2P17 ARHGAP22 P4HA3-AS1 SCGB1D2 SPATA6 SMU1P1 RSL1D1 ZNF460 MIDEAS SND1-IT1 ACKR2 SUMO2P21 ANKRD34A CAD ZMAT1 TDRD12 TRBV30 RAC3 SULT2B1 C11orf98 ZNF841 P3H2 GJB5 SNAP91 HDLBP NQO2-AS1 ANKRD1 CCDC80 KY SPINK8 IL6R PCDH20 ACTG1P20 RBP1 SPTLC3 GAPDHP38 OIP5 DNAJB6P2 SERPINB5 DHRS7 ESCO2 MIR4737 GATA5 NCAPH CLSPN MIR6833 PPP2R2A MIR4428 CDH13 GAPDH-DT RNF157 GJA3 TMTC1 ZNF853 GATA2-AS1 ATAD5 MIR4793 ZNF710 COL4A3 FTH1P10 PPFIBP2 TMPRSS13 AFAP1-AS1 NEK2 ANK1 SNORD35B BTG3-AS1 MIR6730 BMP6 ZDHHC11B MARK3 NCOR2 CALM2P2 ADAM20P1 IL18 SCHLAP1 CDH16 ZBTB20 LINC02343 ZNF697 OXER1 CCDC148-AS1 EIF2S2P3 ZNF654 KLHDC8B EN2 EFNB1 ALDOC HGH1 SNORD69 INTS4P1 NDUFB8P2 NBEAP5 MBOAT7 ACSBG1 LINC01016 EIF4H LINC01529 FGD3 FAM83G RRAS STX17-DT UBASH3B CCDC137 HLF PPP1R9A IRF2-DT CAPN8 DLX5 PTGES KCNIP4 OXR1-AS1 LHX6 PIGW VN1R48P MIR6865 FEM1B EMILIN3 MIR4640 IL17C MIR6866 RNF122 LINC02656 ZNF295-AS1 SLC25A5 CCDC175 C7orf61 RASGEF1C ABCC4 EMP1 CACNA1C FBXL7 TFF2 SRD5A3 KRT87P PLEKHB1 MANCR GCHFR HBEGF DMRT1 TOMM40P1 GPR132 SNORD56 CNIH2 ALDH3A1 P2RX2 NKPD1 HEBP2 S1PR4 PRAP1 PCSK5 EFCAB6-DT GPAA1 MT-TS2 IRX4 GUCY2C SORCS1 ZFP69B OR7E36P SLC4A8 LARGE2 RACGAP1 FAM83E LAPTM5 GABARAPL1 AFF3 KCNN3 SMPD5 OTOAP1 PPP1R14BP2 NEIL3 LINGO3 SPX VCP TMEM51-AS1 SMOC2 GATD3A SFXN5 MIR6775 AGPAT4 ZNF333 CSRP2 NUGGC RPL23AP49 ACRV1 ANTKMT ATP6V1D TCIRG1 CCDC87 NPIPB2 ELAC2 EIF4A1P5 KRT23 RACK1P1 MSLNL HPGD ADGRE2 USH1G DLEU2L SHLD1 EIF4BP5 TRPC6 SNORD62B LINC01176 KCNJ3 CSF1 TSPAN13 CDKN2C MASP1 MIR4751 PVRIG LINC01164 FRG1HP PLAGL1 CASC15 LCN2 PLA2G2A THUMPD1P1 PLAAT4 RAB11FIP5 NDUFA13 NEDD9 NT5DC4 YWHAZP5 SOWAHA PNMA6B TRAV19 LKAAEAR1 ARMT1 LRRC10B EEF1A1P22 LRAT MARCKS GCSHP5 SNORA10 CBR1 KRTAP5-1 MIR6891 DLGAP3 FGR GSTA4 C3 SOCS3-DT PSPC1-AS2 ALDH1L1 DSG2-AS1 TNFSF4 WNT3 ZNF135 AMD1 FAM184A SEC1P NECTIN4 LINC00160 CR2 CD68 SFTPA2 SNORA77B MAB21L4 CTAGE15 PLAC9P1 SLC8A1-AS1 ANKRD17-DT TRIL EGFLAM MIR6741 TUBB1 KCNK12 RUNX2-AS1 CLMN VEPH1 ATP5MF LINC01714 TPBGL-AS1 ADH6 RGL1 CASC19 DNAH10 RN7SK UBE2L4 ARMC7 ADGRG5 DLGAP4-AS1 PHETA2 APLP2 GATA4 GTF2IP7 LMCD1 SNF8 TTC9-DT FGFBP3 FAM91A2P CDK18 CLUHP10 SPINK14 PTPDC1 DTX4 GSTM3P2 LDHAP1 SNORA12 NTF4 GAPDHP52 NUS1P2 CCT5P1 PRKCD BHLHA15 RAET1L LINC01732 PHC2 COLEC10 RASSF2 DSCC1 PGM5P2 ATP5PDP4 TENT4A PPIC HAAO FOXRED2 LINC01918 SYT5 LINC01290 POU2F2 KCNJ18 KIZ-AS1 MIR339 SVIL2P APBA1 RETN ZNF337-AS1 TMEFF1 LINC02716 SERPINE1 MYLK3 ANO1-AS1 DBF4B ASRGL1 USP30 SNX25P1 CYYR1-AS1 ADAM20 CEACAM7 SMARCD2 FAT2 ZNF732 ASTL FRMD6 TNFAIP3 TRAF6 C1RL LINC02428 LINC00173 PLEKHA2 SPIN1 BMP1 LINC01275 PDE6D ACSM3 FBXL4 VWA5A SHANK3 KRT19P1 TUBAP2 RPS3AP27 SYNGR1 MED28-DT MRAP MT-TM LINC01517 RLIMP1 ERVE-1 RNU6-438P MEF2C INTU ZNF285B STK19B C6orf58 LINC02352 C21orf62-AS1 AP1B1 VPS13B-DT IFIT2 KANK3 TTC9B FAM171A1 CNN2P9 CCNO-DT DHRS9 PSMG3 DSG1-AS1 HKDC1 PEG13 HAS2-AS1 NEU1 CLIP3 OR11H13P CCR8 GP2 PLCL2-AS1 ZNF133-AS1 LTB4R2 SNTN CHSY3 TBC1D24 TENM4 GALNT6 GAL3ST1 TIGD2 USP2-AS1 CYCSP38 MIR3064 NR4A3 LINC01132 CDA ACVR1 CES5AP1 GRM1 SHMT1P1 RMI2 IL12A ELL2P1 ABCC1 LCMT2 LINC00957 EPHA8 PDAP1 MRPS7 SNX31 IGFBP5 RPL35AP16 PCDH12 GRK6P1 UPK1B GAPDHP26 AFAP1L1 RPS10P7 MARK3P3 MARCHF1 RFX3 HNRNPRP1 TENM3 GSG1 TRAPPC1 GAPDHP45 EIF1P3 RNU6-914P PRDX3P1 CGNL1 TSPAN18 CHKB-DT LBX2 DNAH3 PRR22 ATP4B DNMT1 AKR1C3 LINC00705 CRHR2 MRPL23-AS1 MIR4658 CLIP2 RXRG SNX18 GGT5 NEDD8 MIR6875 VGF CCDC9B NACA AARS1 IGHG2 ZBTB32 DLL3 ZRANB2-AS2 LAMB2P1 HLA-J DACH1 TOR3A ICAM3 PFDN4 DUOX1 MPPED2 HABP2 NRAP KAT6B ENHO GBAP1 ANGPT4 EBF3 MAPK6P4 MLXP1 GRIK5 ZMAT3 CEACAM8 SEMA6D PDZK1P1 SMIM10L2B-AS1 GALNT5 LIPK CICP4 AMER2 SPRY3 FAR2P2 FAM219A ZFP2 DPF3 SCGB1B2P PRDM11 RPL34P18 ADRB2 ACE WNT11 LINC01143 KCND1 DENND5A CNTNAP5 KIF20A KNTC1 SNORD35A UCA1 FEM1C ERICH2 BRI3 TBX15 NEURL2 LCP2 KCTD21-AS1 POFUT2 UBA52P7 DSN1 RSRC2 PARP6 GOLGA6L4 RPL22P2 SEMA5B HS3ST5 ABHD6 CSPG4P12 MVD SPEF1 ZBTB8OSP2 TIPARP KIF18A CD2AP MIR193A SNTG2-AS1 POTEJ TCIM HCG4P8 GFI1 RNF165 SRA1 ZNF725P PLA2G4F TMEM156 FRG1EP SHH CD3E LINC00501 ZNF723 FTH1P13 SCGB2A2 PCDHA4 FLT1 RASA4CP SLITRK4 SDHDP6 SNORD117 SETP10 SNORA9 PDE6B MAML2 HOTTIP IFIT1 SYT3 PEX11G WNT9A LBP PAFAH1B2P1 CNTN3 RCAN2 SEC62-AS1 DISP2 COX7A2P2 SIAH2-AS1 CKS1BP1 SPRY2 PC MIR6814 OR51B5 NR3C2 ORC1 RPL12P13 SOWAHD RPF2P1 FTH1P23 GAPDHP28 TSFM PSMC5 ITGA2B ZNF17 CCDC40 MIR6876 GLRX3P2 PTGER3 CREB3L2 SH3BP1 FNDC4 TLE2 TGM1 PCDH8 PDZD2 GTF3C6P2 UBE2CP4 ADCY7 VTN LENG9 BNIP3P10 KIAA0930 FAUP2 CEMP1 ZC3H6 BNIP3P11 PDPN CTNNA1P1 LY96 RPSAP14 WBP1LP2 RNU6-1055P NIM1K GPR87 MIR6510 RPL23AP8 MIR936 FZD9 ZNF74 USP8P1 KLB KAT5 LINC01772 CLDND2 GPD1 ALDH2 TUFMP1 IRF1-AS1 GATA3-AS1 ANKRD49P2 ACACB COL5A3 KCNMB1 RPL21P8 AGAP1-IT1 ZNF727 RPGRIP1 LINC00519 DSEL-AS1 PCDHAC1 MAP6 MYT1 MED10 PHF24 SLC30A6-DT MMP1 LINC02485 PGAM4 PITPNM3 AOX2P RAET1E-AS1 LINC00323 SAV1 MRTFA-AS1 RNU6-436P FBXO30-DT PLCB2 PLEKHH2 RPL32P20 CNNM1 HECW2-AS1 HOPX RPL17P36 RPL39P3 RASSF4 LINC01637 ZNF793 MIR6763 MMP2 LINC00365 ESR1 WNT5A-AS1 LINC01409 PTMAP12 KCTD12 TMEM171 RPL21P89 MCF2 LINC01094 KCNV2 OR1L8 RAMP2-AS1 PRSS3 SLAMF8 PDE4C SLC17A5 SEPTIN9-DT SNCA FOXI1 SMILR PTPN21 EEF1A1P9 SMIM35 PCSK9 PTCHD3 SH3TC2-DT CCDC106 CEL TMEM230P1 S100A8 MT1E GABARAPL3 RASSF10-DT PTBP1P PAICSP1 LINC00539 SCARNA12 DSG4 TCN1 ROR2 WDR62 LINC00276 USP54 HNRNPM EPX IL2RG TP73 PSMD10P2 LINC01152 NSG2 PRSS21 LINC00239 ZNF625 C1orf158 PSMB8 SNRPCP3 CD101 PBK LINC01697 NACAP2 SLC25A24P1 CDC42P5 MAST1 RPL7P44 LHFPL6 WWOX RPS27AP6 RNA5SP260 CCL28 MIR583HG IL6-AS1 C16orf86 MYO3B ZXDB CNGB1 TMSB10P1 OR2A42 MIR937 SLC25A38P1 IMPDH1P9 TMEM229A KLHDC7B MECP2 NAV2-AS2 C11orf94 MIR3654 ZNF804B SH3BP4 MFF-DT BRPF3-AS1 PARVB RDM1 LGALS1 SETP8 BHMT MIX23P5 CCDC60 TBXA2R LINC02157 LINC00115 HEATR4 TPT1P6 CCDC17 IL17RD ACTG1P15 LINC00894 DYRK3 RNF157-AS1 TTC3-AS1 RSAD2 RPL15P2 ANAPC10P1 MIF4GD-DT ZBED6CL MED14OS PTMAP11 MZB1 RSKR ZNF551 GPAT4-AS1 CKMT2 MIR3918 RSL24D1 SNX19P3 SQLE-DT LINC01424 GPRC5D-AS1 SMCR5 GAPDHP2 ZNF702P FKBP6 LINC01535 TROAP NAA20P1 EEF1DP8 SAP18P2 ZNF391 MIR27B LINC01356 RPS2P24 KRT6A TF BIRC3 NOXO1 EPHX4 CPB2-AS1 FOXB1 CCDC184 DSCAML1 MIR7706 LINC01892 MIR6746 IGSF9B GLYCTK-AS1 GAB2 TAF7L HMSD GFRA3 PAEP LINC01285 GSEC IDSP1 HNF1A PDZD4 F2R MARCHF5 UNC93B3 FAM124A ARMC10P1 SUGT1P4 CRYM TAS2R31 ST13P15 ARL5AP5 PTP4A1P4 HS3ST3A1 RNVU1-19 SV2C SOHLH1 MAPRE2 ACTG2 SFMBT2 HYI SCX RPL24P2 PTX3 KIF21B MIR4434 CCNYL7 RPL7P8 RNA5SP221 LINC01425 CHRFAM7A NHLRC1 WNT4 SF3B4P1 NBEAP6 RPSAP26 MIR215 MEX3B LETR1 ZSCAN18 PRDM16 MAST3 EEF1A1P12 PRKG2 IL1R2 FANCE CDH5 RHOT1P3 MTRNR2L8 XIAPP1 BRI3BP DPYSL5 CDCA3 EPAS1 LINC02506 MYADM CRMP1 ARHGAP42-AS1 ACTG1P9 CFHR5 SUSD3 OR8B10P NT5CP1 POU5F1B PRNCR1 MIR4740 SRP9P1 DYSF ATP5MKP1 TUBB2BP1 ADAM29 EHD4-AS1 ZFHX2 AGXT PLAC4 NPM1P46 CRISPLD1 HOXA5 TNFRSF14 MIR21 EID2B ADTRP CIT RAB42 PTPRB SDSL RN7SL535P ZNF114-AS1 PTTG3P MMP11 KRT8P1 ERVK-28 NEAT1 FDPSP4 RPS6KA6 RBM22P2 ITPRIP LINC02680 C1orf216 FDPSP7 PTPRD RN7SL659P MIR3190 RNU6-163P C21orf62 SEC14L1P1 ADRA1B RTEL1 TTC23L-AS1 GLS2 CALN1 TGM3 CCN6 ZNF577 WDR77 RPL21P44 PTPRM SOSTDC1 SYDE1 PRDX2P1 KANSL1L-AS1 BPIFA4P FAM95C SOBP LINC00621 STAB2 BACE2 MIR3187 EMSLR LINC02318 DUTP6 UBE2R2-AS1 SLC7A1 FRG1-DT ADGRD1 RNA5SP343 MAG ZNF25 MIR5196 MIR6834 PNMT RPL23AP52 RPL35AP2 SNORA25 TRAF6P1 HIGD1AP14 ARMH1 DLGAP4 LINC01508 SCUBE1 LRMDA CDC20P1 FBXL2 OR7E29P RNU6-780P FCF1P1 GLRB ALG8 IL6 CAVIN3 MLPH LINC02178 POTEF LINC00572 ATOH8 NLGN1 HORMAD2-AS1 EMILIN2 NLRP2B SHBG FUT5 GJA1P1 PIEZO2 SPINK2 SLC12A8 CAPN9 MYCL DDX3Y SAMSN1 CFTR GPR161 KRT17P6 TOMM20L-DT KCNG2 TEX44 CDK8P1 HCG4B ATP6V1E1P1 ASB14 FRG1KP ANKRD7 ATP5PBP2 ASS1P8 MIAT MN1 BMPR1B AOX1 CHP1P3 ZNF462 PTPRVP DNAI4 ACAD8 SNORA60 ALG1L13P CATSPERE EIF4A2P1 GAPDHS CMAHP KLK10 RN7SKP30 LINC00350 SLC35E1 IFITM3P2 ABCA10 LHX1 MIR1260B CYP2C8 PGAM1P7 BRAFP1 ITGA9 CRB2 CHRNA7 RPS15AP12 NUP50P1 ARHGEF35 MAP3K7CL KPNA4P1 HYKK FCGR2B TRIML2 TNRC6B-DT UBR5-DT TMEM130 SOX21-AS1 BMS1P22 TLR3 RPL13AP23 LINC02226 RAB28P5 BDKRB2 RN7SL130P FRG1FP CHKA-DT RNU4-22P NDUFB2 NDUFAB1P1 TEX53 SLC25A48 ABCB4 KRTAP10-2 HRH1 RPL6P25 RBM22P4 EGFLAM-AS1 PPP1R2B CYCSP24 GABPB1 RNU6-957P RAD21P1 ROM1 IGHG4 PDCD6IPP2 SALL2 CPP ELOVL3 ADAMTS6 FAM3B COX20P2 MTND5P26 NASPP1 LINC00589 ZNF132-DT EYS RPS19P7 PTGES2 LINC02600 MRPS11 PRKCZ-AS1 PLEKHO2 MIR16-1 MTATP8P1 DNAAF4 ABI1 SEPHS2 UGP2 SUSD2 TSSK2 MIR6823 CARS1 CAMP SERPINA6 BDKRB1 LINC00845 TMEM178A APBA2 IBSP RN7SKP56 CTBP2P3 ISM1-AS1 RPL12P28 FGF7 ADGRG3 NEXMIF RNU6-319P SPATA4 NBPF20 RPL36P4 GPC2 ABLIM1 JPH1 MIR3960 OR5M3 ST8SIA6 LINC02641 ARF1P1 NPM1P24 MIR6838 IGHEP1 CTRB2 MYLK-AS1 VPS26BP1 MYOG FBN1 SRSF3P5 RAP1AP CROCCP4 SPDYE21 FOXN1 ATP5PBP7 TPI1P4 ZBTB39 FAM183A ADH4 PLA2G1B ELN GNE EEF1A1P29 RPL22P24 CD207 MIR146B LINC02280 LINC02055 PLP1 MIR4482 MRPS5P3 LINC02888 TRAV29DV5 CATSPERZ HMGA2-AS1 TINAGL1 MIR6506 LCE1B BCAP31P2 COX5AP2 MIR1279 CSRP3-AS1 LINC02012 MIR6779 TRBV20OR9-2 RPL8P2 OPN3 HCAR2 VSIG1 LDLRAD4-AS1 TDRP LIPE MIX23P3 TSPY26P GLULP4 SCHIP1 MTMR9LP CCNI2 CLPS DLGAP5 TOLLIP-DT SMIM6 EDA LINC01686 ADAMTS7 SMCO2 RN7SKP116 H1-12P KLF7P1 FNTAP1 MIR3609 LINC02518 NAV2-AS3 RASA3-IT1 MTX3 OR8A3P MPC1-DT ZNF827 LINC00634 BMS1P15 YWHAZP2 HAL RPL3P8 PRTN3 PDE10A TTLL1-AS1 UMODL1-AS1 OR10D3 RPS4XP8 ARHGAP29 SH2D5 COPS8P2 MIR6075 RPS26P41 KCNG4 CEP126 MGAT4EP SLC2A3P4 MKI67 TMPRSS7 RNA5SP283 KCNJ6 PROKR1 YPEL5P2 MSN RN7SL431P SPEF2 TGIF1P1 AKAP12 GRM6 SLC6A16 CHRNE RPL18AP15 GATA6-AS1 BACH1-IT1 LINC01441 CAMK2D LINC01134 SLC5A5 MAFTRR HMGN1P35 GPR37L1 MIR6844 NELL1 GJA1 MRAP-AS1 MESP2 ALMS1-IT1 GRXCR2 SPIRE1 GSTP1 CYP4Z1 KRT8P43 SLC52A3 CBX5P1 MIR4690 TSSK3 TXNP4 FOXD2-AS1 DAPK1 C16orf92 PLCD4 TCEAL8 PPIL1 MANBA LINC01747 DNM3 PRICKLE2-AS3 CCDC110 HOMER2 NPIPA9 MIR6790 TMSB15B-AS1 IFI6 ZNF419 SYT11 LINC02851 SNTG1 HCLS1 UBASH3A OR8G5 HLA-DQB2 KCTD5P1 GSDMD NRN1L GAB3 EIF3IP1 RNF222 SLC22A13 CLRN1-AS1 GNG10P1 HSP90AA4P CDHR4 EXTL3-AS1 PSMC1P8 MIR5188 P2RY1 EIF3LP1 TMTC2 KLF3P1 F7 SV2B OR8T1P RNF20 ANKRD11P2 DDX59-AS1 OPN1SW LINC01366 NLRP3P1 LINC00534 SEPTIN7P8 PHBP7 RNU6-883P GAPDHP67 RRN3P2 CHI3L1 OXCT1 MFAP4 BET1 RPS2P2 HYI-AS1 IDH1-AS1 PINCR PAQR8 ZNF460-AS1 MIRLET7F1 PSMC1P11 H2BC18 ALDH1A3-AS1 GAPDHP48 ZNF649 PHF2P2 PPARGC1A ANP32BP1 ADAMTS2 RNU6-418P MAP3K2-DT AATBC RNA5SP439 HMGN2P38 FAM3D RTCA-AS1 HIC2 UGT1A12P FHAD1 PCOLCE2 LINC00858 HS3ST6 MAPK8IP2 TAPT1-AS1 SLC1A6 LINC00664 RPL21P41 INPP5J SCARNA3 MTND4LP30 HLA-DRB9 STX7 PRB3 VDAC1P7 TONSL-AS1 TLR6 SF3A3P1 SHOX2 MIR637 LINC01397 OR8B2 RN7SL743P MIR193B HAUS6P1 PTGS1 ZNF320 LINC00266-1 MRPS31P2 SF3A3P2 LEFTY1 SYNPR-AS1 RN7SL164P ALOX12B MIR421 MT-TV HERC2P3 CNN2P12 DNAI3 IMPDH1P2 MIR4523 MIR4675 SNORD34 RPS23P1 HENMT1 GNRH1 C5AR2 ARX LUADT1 RPS5P2 SLCO1A2 GDAP1L1 NADK2-AS1 SLC6A19 HBQ1 LRP1 HMGN2P10 PLAC1 ANKRD49P1 RPL36AP45 MIR6872 MAGEE1 CCDC200 CBX3P1 CALCB LINP1 RPL32P16 PRL PBX1-AS1 MTHFD2P7 FENDRR FOXD3-AS1 RPL22P1 MIR193BHG FNDC3CP RNF213-AS1 ARHGEF18-AS1 ZNF221 EVX1 ROBO3 SNORA50A RBMS1P1 GOLGA8H MIR6836 LINC02895 GPR55 KRTAP1-3 TNNC2 APOB PCNPP3 AFTPH-DT ATP5F1EP2 EEF1A1P2 F8A3 HCG27 LINC02816 VN1R83P BHLHE41 APLF SERPINA4 MMP21 MACROD2-IT1 TMEM132E LBX1-AS1 BNC2-AS1 OXGR1 HTR5A RNU6-460P GTF2IRD2P1 CHST9 ZBBX LINC02019 NPR3 LINC01311 PRSS29P KRT8P4 DSC1 KAT7P1 RNVU1-2A ANO7L1 RPS26P15 PRKN INSC HPCAL4 CAHM SLC12A4 COX6CP2 ZDHHC1 MBLAC1 CORO1A MYL12BP2 CASS4 MTND4LP7 RN7SL89P LINC00997 ZNF517 LRIG2 EPB41L4A-AS1 GUCY1B1 ACTR1AP1 PRRT4 LINC02443 ACTBP12 ANAPC1P2 PDE4DIPP7 NACA2 PRIM1 H2AZP1 ARHGAP26 TMEM145 KCNQ4 CCDC181 RPSAP6 RNA5SP437 MIR2110 RNFT1P3 SLC4A1 SNORD36B MTND5P1 ADAM11 EDIL3-DT ANKRD18B TMPRSS11A SMAD5 ZCCHC18 MBTPS1-DT NRSN2-AS1 ZSCAN5C DEFB1 DIAPH2-AS1 HOXB6 MIR4284 CFAP69 HNRNPA1P46 CCDC152 IL21R IL21R-AS1 ANKRD20A19P GRIA3 CCNJP2 CORO2B MIR181B2 NOS2 THSD8 PTCH2 NIFKP4 NCMAP ACTBP7 NME5 RNU6-1285P TTC4P1 PMS2P11 FAM43B GVINP1 MEF2C-AS1 MEGF10 FAM166C PTCHD3P2 TRABD2B KCNMB2 IGF1 RPL7P58 ROCR VGLL1 ACTP1 BMP8A ASTN2 LRFN2 CNTNAP3C BCAS2P1 CICP13 LINC02463 ZNF658 TXNDC8 ABHD14A-ACY1 CDH17 DYNAP LONRF3 LINC01091 PNPLA1 GCATP1 GNMT SEC61G SBK2 AOC2 TMEM169 ELAVL2 RTKN CHID1 SLC4A1APP1 PICART1 PDC-AS1 CLDN14 SNORA63D FBLN2 RPL23AP12 PDCL3P2 PTTG2 ADORA3 ARHGAP31 RNY3P15 DYNLT3P2 LIG1 ZFPM2-AS1 SELENOP FBLN7 P2RX5 SPRY4 MIR6859-1 CSTA JMY HCAR3 CGB3 KRT18P6 USP51 WASIR1 ACER2P1 MIR365A CSMD2 ENPP7P7 RNU4-78P CHST1 LINC00648 LINC01361 IQCN MIR7851 C1QTNF1 SPATA45 PLCL2 FAM114A1 GATA1 CTBP2P8 ATP13A4 RPS17P5 PPP1R2 FYB1 RBMXP3 RNU6-481P C16orf96 CALM2P3 NEXN ZXDA TPRKBP2 DHX58 IL1A C20orf144 C19orf71 MIR1234 SLC38A3 LINC02904 PPIAP31 RPL21P135 SASH1 U2AF1L5 NPAS2-AS1 RSPO1 POU3F2 C8orf74 FRMPD1 LINC00942 KRT18P40 MIR600 DSEL RMDN2-AS1 RNU6-455P AGGF1P1 GAPDHP24 MT1L LINC01907 CD4 PZP SMPD4P1 EPCAM-DT UBE2Q2L NCF2 PAX7 IPO8P1 CCDC160 AKR1B1 KCNH6 RPS4XP19 RPL22P16 LINC02615 BOD1L1 DUTP7 RPS29P7 INSL6 AQP7 MIR3189 EVPLL SLC19A3 RPS3AP29 LEF1 RPS17P1 TRAV27 MSLN TRIM34 ICMT HAS2 SNORD38A TNKS LINC02694 STX8P1 ST6GALNAC4 NME2P2 ARPP21 GRASLND PAX2 RFTN1 VSTM2A CTRB1 SCARNA1 PIH1D2 FAM13C PLPPR3 PRDX3P2 TMEM190 HMCN2 RNU6-1280P KRTDAP SNORA79B PSMD7P1 PRKY APOOP2 CCL26 YBX1P10 PTAFR ZNF441 FAM87B TUBAP4 S100A3 GNG8 TAS2R13 SERPINA9 PPIAP85 ZBTB46 RPL31P63 LYPLA2P1 BLZF2P EXOC3L2 SLC2A7 GASAL1 CENPF NKX2-1 C9orf57 OR6K4P PDGFRB CTSLP2 FOXQ1 SERHL2 CATSPER1 KLF2P1 PHF3 TG CCL4L2 CNTNAP3B LINC00955 MIR1825 GAPDHP23 RPL10AP2 RBMX2P3 C1QTNF3 PNPO NFYCP2 PPIAP40 MUC4 XKR7 KCNQ2 KIAA1210 RPL32P6 TMEM266 GALNT15 RPS15AP6 ZNF532 MIR4720 RPL21P93 SHISAL2A KRT18P56 SPSB3 JAM2 SUMO2P1 FOXP1-AS1 INCA1 C20orf27 NAT8B SARM1 ST3GAL1-DT SEC14L5 MAGEC3 SHLD2P3 HMGN1P8 COL4A2 LINC00460 MIR3139 MYO1G LINC02595 C1QL1 MIR155 MYBPC1 CDCP1 SFTPA1 ABHD12B MYO7A RPL13AP2 POLG-DT KLK4 SPINK5 SLC9A9 DIS3L-AS1 C5orf46 RPL19P20 CNTN2 TSPOAP1 LINC01338 TRPM2 LINC00167 FBXL19 LINC00840 NBEAP1 KCNT1 GUCA1A GPHA2 SRMP2 NMD3P1 KIAA1217 CYP2T3P AJAP1 APOBEC3B SPAG16 BEAN1 OR7E22P CYP3A7 CYP3A7-CYP3A51P ZDHHC22 LINC02335 SLN ITGA6 ENTPD8 FOXA3 OR52K3P KRTAP9-12P RPL36P2 RPS3AP26 TPBGL SIRT4 LRRC4C LINC01238 C22orf23 TPI1P2 LINC01186 RN7SL354P CARNMT1-AS1 NMRK2 RCC2P6 ZNF571-AS1 SEPHS1P6 AP1M2P1 CDC42-IT1 UFM1P2 SCN3B PKNOX2 APOBEC3G IRAK2 GALNT16 AGO4 POTEG LINC00626 WFDC3 MYOM1 CBX3P2 ZWINT EEF1A1P1 OR10AC1 LIPM RPL37P2 YPEL4 TCAF2C PIGHP1 TBCAP1 MT-TG C1GALT1C1L BEX1 C1QL4 DUSP5-DT KRT15 CMPK2 ADRA2B CXCL8 COP1P1 SMYD3-AS1 ODF3 VSTM4 BTF3L4P1 ARMC3 SEMA7A MIR1972-1 RNU2-27P PRKCQ RPL32P27 RNA5SP141 HLA-DMB MIR3621 ITPRIP-AS1 P3H4 NCR3 LINC01228 LINC00494 ESYT3 EEF1A1P11 PTGIS RSL24D1P1 CHMP5P1 EGR2 PTPRC LINC01114 HOXD8 RNY1P15 KIAA0408 TFGP1 PPP4R1-AS1 ACTG1P3 LINC01933 CCL3 TUBBP2 FRMD5 SGCD ARPP19P1 MIR6740 PEG10 HMGB1P3 RPSAP69 RSL24D1P6 SUMO2P6 MIR5006 TNIP1 SNHG28 RNA5SP37 RBM11 PRKAG2-AS1 RN7SL775P IL11RA LINC01305 ATP6V0E1P3 RN7SL4P CRBN MON1A CCR2 SLC6A20 LINC02533 LINC01362 COL7A1 SNORD3B-1 DEPDC1P1 RASAL2-AS1 SNORD54 ACSM4 OR7E90P H3P47 SETP22 VEGFD GPBAR1 RN7SL466P ABCB10 SCML2P1 ATP6V0E1P2 C1orf94 GCM2 SDR9C7 MAS1 FNDC7 NACAD IFFO1 SPANXB1 PTMAP1 LINC02300 SRCIN1 OGFRP1 TMEM121B CATSPER3 LINC01978 RPS8P4 EVI2B HES7 ZFP37 ALDH3B1 MIR544B RPL7P9 KLHL38 RNU1-134P RN7SL443P G0S2 SLC7A9 PCSK1 DIRAS3 MIR23A FAM157A UPK3A SLC9A7P1 RHEX FLNC SNORA20 KRT8P27 UQCRBP2 DNAJC28 WWP1P1 SNORD52 CLLU1 MIR4513 DDX12P HSPA2-AS1 CCND2-AS1 CCND2 RPL26P30 TNFAIP8 RGMA ARHGAP44-AS1 MIR548O MIR933 MIR6165 ENPP2 RNU7-40P LINC02679 BRWD1-AS1 MIR34A NOTO SNORD70B SEPTIN7P7 MYBL2 LRIG2-DT RPP25 MIR30B ZNF826P RDM1P1 MIR6810 POLH-AS1 FZD1 RPL12P47 RPS7P14 RNU6-29P C1GALT1 BZW1P2 RPL13AP7 PRAM1 EIF2S2P4 RBPMS2 SOX10 LINC00640 FAM133FP FAM217A LINC01068 LINC01864 MTATP8P2 ITGB1 HLA-DRB1 HSPA8P16 KLHDC7B-DT ST18 LINC02223 COX6B1P4 HNRNPA1P47 NT5M OR7E37P MIS18A-AS1 LINC02269 SLC4A9 ADCY5 MYCNUT IL17REL IGHV4-34 MAD2L1-DT H3P11 RPL31P7 NLRP3 IGSF22 HMGA1P7 KRT85 KCNC2 SLC25A27 LST1 CICP9 TNFAIP6 FGG LYG2 FABP6-AS1 NOG RP9 CLDN11 ANGPTL2 CSF3R LINC01749 PRKAR2B-AS1 LINC00608 VAX1 RPL23AP35 CALCA DBIL5P2 LYPLA1P3 MEAF6P1 ZMYND10 SLC8A3 DLG5-AS1 PDE1A TRIM67 MEDAG ITPRID1 YY2 RN7SL166P UBE2S TBPL2 CENPK TMCO2 MMP10 KCTD9P1 WDHD1 SNORA73B MEFV PSMD8P1 YIPF7 MINAR2 ABCC6P2 ISOC2 TXNP5 PLAT JAG1 LINC01185 TTYH2 CGB7 LINC02068 LINC01701 CALHM3 RPL37A-DT ME3 CNTNAP3P1 ITGA6-AS1 PIGM RPL7AP11 SERHL LINC02052 NIFKP8 ACTN3 C20orf202 MAPK4 UROC1 OLFML2A RN7SL253P NFYBP1 HHIP-AS1 DKKL1 LINC00865
In this section we want to explore the dataset using unsupervised learning techniques.
We will use the dataset Train and we will carry out our analysis focusing on MCF7 - Smart Seq, adding remarks and considerations on HCC1806 when needed.
Applying the logarithm to our data will be helpful in its visualization.
Reducing the dimension of the dataset is useful for some reasons:
We start by performing PCA on cells' features: we try to reduce the number of genes (dimensions) while trying to maintain the ones with the highest values of variance (we take 95% as a threshold). PCA is performed on the original dataset, with no transformation applied.
PCA_data = PCA(n_components=0.95)
data_train_red = PCA_data.fit_transform(data_train)
print("Number of components:", PCA_data.n_components_)
print('Explained variation per principal component: {}'.format(PCA_data.explained_variance_ratio_))
Number of components: 20 Explained variation per principal component: [0.6344835 0.09107496 0.06270155 0.04033215 0.03156572 0.01538185 0.01138705 0.01009336 0.00898676 0.00758287 0.00614321 0.00492391 0.00467293 0.00430685 0.00377886 0.00331298 0.00300167 0.0027689 0.00266241 0.00228027]
print("Reconstruction error:", mean_squared_error(PCA_data.inverse_transform(data_train_red), data_train))
Reconstruction error: 24488.13749442456
For HCC1806 experiment the results are similar: there are 35 PC instead of 20, and the reconstruction error is quite high.
PCA_data = PCA(n_components=0.95)
data_train_red = PCA_data.fit_transform(data_train)
print("Number of components:", PCA_data.n_components_)
print('Explained variation per principal component: {}'.format(PCA_data.explained_variance_ratio_))
Number of components: 34 Explained variation per principal component: [0.29018923 0.18101256 0.12288734 0.07970126 0.04956884 0.03640102 0.02737402 0.02113149 0.01743474 0.01317286 0.01208516 0.0111941 0.00880007 0.00783661 0.00748447 0.00693555 0.00579659 0.00509238 0.00467221 0.00422117 0.00399608 0.00383594 0.00365575 0.00342266 0.00304432 0.00285331 0.00257095 0.00235782 0.0022066 0.00215775 0.00209672 0.00204413 0.00186566 0.00174111]
print("Reconstruction error:", mean_squared_error(PCA_data.inverse_transform(data_train_red), data_train))
Reconstruction error: 17554.361399621102
We would also like to see how much the first components are insightful as part of the variance, and how the variance varies as we increase the number of components.
exp_var_pca = PCA_data.explained_variance_ratio_
v = len(PCA_data.explained_variance_ratio_)
cum_sum_eigenvalues = np.cumsum(exp_var_pca)
plt.bar(range(0,len(exp_var_pca)), exp_var_pca, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(cum_sum_eigenvalues)), cum_sum_eigenvalues, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
legend = plt.legend(loc='best', frameon=True)
legend.get_frame().set_edgecolor('black')
plt.tight_layout()
plt.title("MCF7 - PCA")
plt.grid(visible=False)
plt.xticks([i for i in range(v)], [i+1 for i in range(v)])
plt.show()
It's interesting to see that the first component explains more than 60% of the variance and the second one is significantly lower. This is not true for HCC1806, where the first component is reponsible for only 29% of the variance but the decrease with the second component is less dramatic.
exp_var_pca = PCA_data.explained_variance_ratio_
v = len(PCA_data.explained_variance_ratio_)
cum_sum_eigenvalues = np.cumsum(exp_var_pca)
plt.bar(range(0,len(exp_var_pca)), exp_var_pca, alpha=0.5, align='center', label='Individual explained variance')
plt.step(range(0,len(cum_sum_eigenvalues)), cum_sum_eigenvalues, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
legend = plt.legend(loc='best', frameon=True)
legend.get_frame().set_edgecolor('black')
plt.tight_layout()
plt.title("HCC1806 - PCA")
plt.grid(visible=False)
plt.xticks([i for i in range(v)], [i+1 for i in range(v)])
plt.show()
Let's visualize the first five components plotted against each other: we can see that distribution of the cells is quite different.
n_components = 5
pca = PCA(n_components=n_components)
components = pca.fit_transform(data_train)
total_var = pca.explained_variance_ratio_.sum() * 100
labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'Condition'
fig = px.scatter_matrix(
components,
color=data_train_lab["Condition"],
dimensions=range(n_components),
labels=labels,
title=f'MCF7 - Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)
fig.show()
n_components = 5
pca = PCA(n_components=n_components)
components = pca.fit_transform(data_train)
total_var = pca.explained_variance_ratio_.sum() * 100
labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'Condition'
fig = px.scatter_matrix(
components,
color=data_train_lab["Condition"],
dimensions=range(n_components),
labels=labels,
title=f'HCC1806 - Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)
fig.show()
For visualization purposes, we now set the number of components to 2 and then to 3. Starting from the reduced dataset, we plot each datapoint (cell) in green if it is from Hypoxia environment and in red if it is from Normoxia environment.
PCA2_data = PCA(n_components=2)
principalComponents_hcc2 = PCA2_data.fit_transform(data_train)
data_pr2 = pd.DataFrame(data = principalComponents_hcc2
, columns = ['PC1', 'PC2'])
print('Explained variation per principal component: {}'.format(PCA2_data.explained_variance_ratio_))
Explained variation per principal component: [0.6344835 0.09107496]
data_pr2_lab = data_pr2.copy()
data_pr2_lab["Condition"] = [i for i in range(n)]
for i in range(n):
if (data_train_lab["Condition"][i] == "Norm"):
data_pr2_lab["Condition"][i] = 0
elif (data_train_lab["Condition"][i] == "Hypo"):
data_pr2_lab["Condition"][i] = 1
/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35472/721786036.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35472/721786036.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
x = np.array(data_pr2_lab['PC1'])
y = np.array(data_pr2_lab['PC2'])
plt.scatter(x, y, c=data_pr2_lab["Condition"], cmap="prism")
plt.title("MCF7 - PCA of cells")
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
For HCC1806:
PCA2_data = PCA(n_components=2)
principalComponents_hcc2 = PCA2_data.fit_transform(data_train)
data_pr2 = pd.DataFrame(data = principalComponents_hcc2
, columns = ['PC1', 'PC2'])
print('Explained variation per principal component: {}'.format(PCA2_data.explained_variance_ratio_))
Explained variation per principal component: [0.29018923 0.18101256]
data_pr2_lab = data_pr2.copy()
data_pr2_lab["Condition"] = [i for i in range(n)]
for i in range(n):
if (data_train_lab["Condition"][i] == "Normo"):
data_pr2_lab["Condition"][i] = 0
elif (data_train_lab["Condition"][i] == "Hypo"):
data_pr2_lab["Condition"][i] = 1
/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35463/3019903764.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35463/3019903764.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
x = np.array(data_pr2_lab['PC1'])
y = np.array(data_pr2_lab['PC2'])
plt.scatter(x, y, c=data_pr2_lab["Condition"], cmap="prism")
plt.title("HCC1806 - PCA of cells")
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
As already noticed, the cells are distributed in quite a different way.
In 2D normoxic and hypoxic cells in MCF7 seems clearly divided, while it is not the case for HCC1806.
PCA3_hcc = PCA(n_components=3)
principalComponents_hcc3 = PCA3_hcc.fit_transform(data_train)
data_pr3 = pd.DataFrame(data = principalComponents_hcc3
, columns = ['PC1', 'PC2', 'PC3'])
print('Explained variation per principal component: {}'.format(PCA3_hcc.explained_variance_ratio_))
Explained variation per principal component: [0.6344835 0.09107496 0.06270155]
data_pr3_lab = data_pr3.copy()
data_pr3_lab["Condition"] = [i for i in range(n)]
for i in range(n):
if (data_train_lab["Condition"][i] == "Norm"):
data_pr3_lab["Condition"][i] = 0
elif (data_train_lab["Condition"][i] == "Hypo"):
data_pr3_lab["Condition"][i] = 1
/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35472/1474751301.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35472/1474751301.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
def PCA_3(EL, AZ):
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(projection='3d')
x = np.array(data_pr3_lab['PC1'])
y = np.array(data_pr3_lab['PC2'])
z = np.array(data_pr3_lab['PC3'])
scatter = ax.scatter(x, y, z, c=data_pr3_lab["Condition"], cmap="prism")
labels = ["Normoxia", "Hypoxia"]
legend_handles, legend_labels = scatter.legend_elements()
legend = ax.legend(handles=legend_handles, labels=labels, loc='center left', bbox_to_anchor=(0, 0.8))
ax.view_init(elev=EL, azim=AZ)
print("Elevation:",EL," Azimut:",AZ)
PCA_3(20,120)
Elevation: 20 Azimut: 120
PCA3_hcc = PCA(n_components=3)
principalComponents_hcc3 = PCA3_hcc.fit_transform(data_train)
data_pr3 = pd.DataFrame(data = principalComponents_hcc3
, columns = ['PC1', 'PC2', 'PC3'])
print('Explained variation per principal component: {}'.format(PCA3_hcc.explained_variance_ratio_))
Explained variation per principal component: [0.29018923 0.18101256 0.12288734]
data_pr3_lab = data_pr3.copy()
data_pr3_lab["Condition"] = [i for i in range(n)]
for i in range(n):
if (data_train_lab["Condition"][i] == "Normo"):
data_pr3_lab["Condition"][i] = 0
elif (data_train_lab["Condition"][i] == "Hypo"):
data_pr3_lab["Condition"][i] = 1
/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35463/1377304862.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35463/1377304862.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
def PCA_3(EL, AZ):
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(projection='3d')
x = np.array(data_pr3_lab['PC1'])
y = np.array(data_pr3_lab['PC2'])
z = np.array(data_pr3_lab['PC3'])
scatter = ax.scatter(x, y, z, c=data_pr3_lab["Condition"], cmap="prism")
labels = ["Normoxia", "Hypoxia"]
legend_handles, legend_labels = scatter.legend_elements()
legend = ax.legend(handles=legend_handles, labels=labels, loc='center left', bbox_to_anchor=(0, 0.8))
ax.view_init(elev=EL, azim=AZ)
print("Elevation:",EL," Azimut:",AZ)
PCA_3(20,120)
Elevation: 20 Azimut: 120
In 3D cells of HCC1806 are better separated: this makes sense, as the variance explained by the third principal component is roughly similar to the variance explained by the second one (18% and 12%), so also the third component is relevant.
PCA on genes is mainly done for visualization and later for clustering. We are not interested in reducing the dimensions per se, as it is not so relevant from a biological point of view.
PCA_hcc_g = PCA(n_components=3)
pc_hcc_genes = PCA_hcc_g.fit_transform(data_genes)
data_pr3_g = pd.DataFrame(data = pc_hcc_genes
, columns = ['PC1', 'PC2', 'PC3'])
print('Explained variation per principal component: {}'.format(PCA_hcc_g.explained_variance_ratio_))
Explained variation per principal component: [0.64301682 0.06518905 0.0184779 ]
x = np.array(data_pr3_g['PC1'])
y = np.array(data_pr3_g['PC2'])
plt.scatter(x, y, c="green", s=20)
plt.title("MCF7")
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
x = np.array(data_pr3_g['PC1'])
y = np.array(data_pr3_g['PC2'])
plt.scatter(x, y, c="green", s=20)
plt.title("HCC1806")
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
The plots are clearly similar for both cell lines.
Clustering is a crucial tool to gain insights on the datasets, especially when we have an enormous amount of features and it is difficult to understand how the data is structured. Ideally, we would like to obtain 2 clusters which we could identify with cells cultivated in Hypoxia and cells cultivated in Normoxia.
The types of clustering used are:
We start by doing the clustering in full dimensions and then plotting the clusters found with PCA.
We try out some methods to determine the right number of clusters.
The elbow is a heuristic method consisting in plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.
fig, ax = plt.subplots()
visualizer = KElbowVisualizer(KMeans(random_state=42), k=(2,7), ax=ax)
visualizer.fit(data_train)
ax.set_xticks(range(2,7))
visualizer.show()
plt.show()
Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters in a range of [-1, 1].
Silhouette coefficients close to +1 indicate that the sample is far from the neighboring clusters. A value of 0 indicates that the sample is very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. The Silhouette score is a mean of the values.
from sklearn.metrics import silhouette_score
silhouette_scores = []
for k in range(2, 7):
km = KMeans(n_clusters=k,
max_iter=300,
tol=1e-04,
init='k-means++',
n_init=10,
random_state=42,
algorithm='auto')
km.fit(data_train)
silhouette_scores.append(silhouette_score(data_train, km.labels_))
fig, ax = plt.subplots()
ax.plot(range(2, 7), silhouette_scores, color="black")
#ax.set_title('Silhouette Score Method')
ax.set_xlabel('Number of clusters')
ax.set_ylabel('Silhouette Scores')
plt.xticks(range(2, 7))
plt.tight_layout()
plt.show()
def silhouette_plot(X, model, ax, colors):
y_lower = 10
y_tick_pos_ = []
sh_samples = silhouette_samples(X, model.labels_)
sh_score = silhouette_score(X, model.labels_)
for idx in range(model.n_clusters):
values = sh_samples[model.labels_ == idx]
values.sort()
size = values.shape[0]
y_upper = y_lower + size
ax.fill_betweenx(np.arange(y_lower, y_upper),0,values,
facecolor=colors[idx],edgecolor=colors[idx]
)
y_tick_pos_.append(y_lower + 0.5 * size)
y_lower = y_upper + 10
ax.axvline(x=sh_score, color="red", linestyle="--", label="Avg Silhouette Score")
ax.set_title("Silhouette Plot for {} clusters".format(model.n_clusters))
l_xlim = max(-1, min(-0.1, round(min(sh_samples) - 0.1, 1)))
u_xlim = min(1, round(max(sh_samples) + 0.1, 1))
ax.set_xlim([l_xlim, u_xlim])
ax.set_ylim([0, X.shape[0] + (model.n_clusters + 1) * 10])
ax.set_xlabel("silhouette coefficient values")
ax.set_ylabel("cluster label")
ax.set_yticks(y_tick_pos_)
ax.set_yticklabels(str(idx) for idx in range(model.n_clusters))
ax.xaxis.set_major_locator(ticker.MultipleLocator(0.1))
ax.legend(loc="best")
return ax
k_max = 7
ncols = 3
nrows = k_max // ncols + (k_max % ncols > 0)
fig = plt.figure(figsize=(15,15), dpi=200)
for k in range(2,k_max+1):
km = KMeans(n_clusters=k,
max_iter=300,
tol=1e-04,
init='k-means++',
n_init=10,
random_state=42,
algorithm='auto')
km_fit = km.fit(data_train)
ax = plt.subplot(nrows, ncols, k-1)
silhouette_plot(data_train, km_fit,ax, cluster_colors)
fig.suptitle("Silhouette plots", fontsize=18, y=1)
plt.tight_layout()
plt.show()
Analyzing these plots, we understand that the best choice of clusters should be 2. The elbow method also suggests that clustering with k=3 makes sense. This means that there may be a further division between cells in addition to the basic 'Hypoxia' and 'Normoxia'.
We also see that for every choice of k there is a cluster that is more defined, the bigger one.
Let's proceed with clustering:
kmeans = KMeans(n_clusters=2, random_state=2352).fit(data_train)
kmeans.labels_
array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 1, 1, 1], dtype=int32)
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=kmeans.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title("2-means clustering")
plt.show()
KM_plot(20, 120, kmeans)
Elevation: 20 Azimut: 120
diagnoses(kmeans, data_train, cluster_colors)
kmeans2 = KMeans(n_clusters=3, random_state=2352).fit(data_train)
kmeans2.labels_
array([1, 1, 1, 1, 0, 2, 0, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 1, 1, 1,
1, 0, 2, 2, 0, 0, 1, 1, 1, 1, 1, 1, 2, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 2, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 2, 2, 0, 0, 2, 0, 1, 1, 1, 1, 1, 0, 0, 2, 0, 2, 0, 1, 1,
1, 1, 1, 1, 2, 0, 2, 0, 0, 2, 1, 1, 1, 1, 1, 1, 0, 2, 2, 0, 0, 2,
1, 1, 1, 1, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 2, 2, 2, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 2, 0, 2, 1, 1, 1,
1, 1, 1, 2, 0, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 0, 2, 2, 0, 2, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 0, 0, 0, 2, 0, 1,
1, 1, 1, 1, 1, 0, 0, 2, 2, 0, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 0, 0, 2], dtype=int32)
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=kmeans2.labels_, cmap=ListedColormap(cluster_colors[:3]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title("3-means clustering")
plt.show()
KM_plot(20, 120, kmeans2)
Elevation: 20 Azimut: 120
diagnoses(kmeans2, data_train, cluster_colors)
Doing clustering we identify two main clusters. Comparing the plots with the visualization of PCA, we understand that these two clusters effectively divide cells into normoxic and hypoxic with a high accuracy.
Let's quantify how good is this division by defining a clustering accuracy.
def clustering_accuracy(clust_labels, og_labels):
print("Clustering accuracy:",
max(np.count_nonzero(clust_labels == og_labels) * 100 / len(og_labels), 100 - np.count_nonzero(clust_labels == og_labels) * 100 / len(og_labels)),
"%")
og_labels = data_pr2_lab["Condition"].values
clust_predict = kmeans.labels_
clustering_accuracy(og_labels, clust_predict)
Clustering accuracy: 97.2 %
Hence, K-means clustering is able to distinguish cells with a 97.2% precision score, measured as the number of correct classifications divided by the total number of samples.
Doing a 3-means clustering, we notice that we can still visualize the Normoxic cells' cluster that we detected with a 2-clustering, while the cluster corresponding to Hypoxic cells is splitted in two halves. The interpretation we could give is that there are two subclasses of Hypoxic cells, that may be related to a different level of oxygen supply (the blue cluster could be the cells with less oxygen) and to some other factors that should be discussed with an expert.
We perform agglomerative clustering using the standard euclidean distance, that fits quite well the task of identifying the distance between cells, and using the ward linkage. Other linkage were tried out (single, average and complete) but the results were much worse.
agglomerative = AgglomerativeClustering().fit(data_train)
agglomerative.labels_
array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0, 0, 0, 0])
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=agglomerative.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
AG_plot(20,120)
agg_predict = agglomerative.labels_
clustering_accuracy(og_labels, agg_predict)
Clustering accuracy: 98.0 %
We can also plot the results in a dendogram:
plot_dendrogram(agglomerative)
The performed agglomerative clustering seems to confirm the results of the k-means clustering: the accuracy is 98%.
Now we can proceed by performing clustering on the space defined by the first 2 and 3 principal components. We start by performing agglomerative clustering.
agg_PC2 = AgglomerativeClustering().fit(data_pr2)
agg_PC2.labels_
array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 0, 0, 0])
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=agg_PC2.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
aggPC2_predict = agg_PC2.labels_
clustering_accuracy(og_labels, aggPC2_predict)
Clustering accuracy: 99.2 %
agg_PC3 = AgglomerativeClustering().fit(data_pr3)
agg_PC3.labels_
array([1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0, 0, 0, 0])
aggPC3_predict = agg_PC3.labels_
clustering_accuracy(og_labels, aggPC3_predict)
Clustering accuracy: 98.4 %
AGPC_plot_int(20,120)
We can see that the cluster accuracy is even higher, both in the two and three dimensional spaces of principal components. Since the accuracy is already high, we avoid performing k-means on the space of principal components.
Now we pass to clustering for HCC1806, as the results are quite different. In particular, we will see how the resulting clusters do not resemble the division in hypoxic and normoxic groups, previously visualized with PCA. The methods, technique and analysis are similar to the ones used for MCF7.
fig, ax = plt.subplots()
visualizer = KElbowVisualizer(KMeans(random_state=42), k=(2,7), ax=ax)
visualizer.fit(data_train)
ax.set_xticks(range(2,7))
visualizer.show()
plt.show()
/Users/emanuelemarinolibrandi/opt/anaconda3/lib/python3.9/site-packages/yellowbrick/utils/kneed.py:156: YellowbrickWarning: No 'knee' or 'elbow point' detected This could be due to bad clustering, no actual clusters being formed etc. /Users/emanuelemarinolibrandi/opt/anaconda3/lib/python3.9/site-packages/yellowbrick/cluster/elbow.py:374: YellowbrickWarning: No 'knee' or 'elbow' point detected, pass `locate_elbow=False` to remove the warning
No knee or elbow point is detected in this case: we could already say that maybe cells in HCC1806 are not clearly diveded into clusters.
from sklearn.metrics import silhouette_score
silhouette_scores = []
for k in range(2, 7):
km = KMeans(n_clusters=k,
max_iter=300,
tol=1e-04,
init='k-means++',
n_init=10,
random_state=42,
algorithm='auto')
km.fit(data_train)
silhouette_scores.append(silhouette_score(data_train, km.labels_))
fig, ax = plt.subplots()
ax.plot(range(2, 7), silhouette_scores, color="black")
#ax.set_title('Silhouette Score Method')
ax.set_xlabel('Number of clusters')
ax.set_ylabel('Silhouette Scores')
plt.xticks(range(2, 7))
plt.tight_layout()
plt.show()
def silhouette_plot(X, model, ax, colors):
y_lower = 10
y_tick_pos_ = []
sh_samples = silhouette_samples(X, model.labels_)
sh_score = silhouette_score(X, model.labels_)
for idx in range(model.n_clusters):
values = sh_samples[model.labels_ == idx]
values.sort()
size = values.shape[0]
y_upper = y_lower + size
ax.fill_betweenx(np.arange(y_lower, y_upper),0,values,
facecolor=colors[idx],edgecolor=colors[idx]
)
y_tick_pos_.append(y_lower + 0.5 * size)
y_lower = y_upper + 10
ax.axvline(x=sh_score, color="red", linestyle="--", label="Avg Silhouette Score")
ax.set_title("Silhouette Plot for {} clusters".format(model.n_clusters))
l_xlim = max(-1, min(-0.1, round(min(sh_samples) - 0.1, 1)))
u_xlim = min(1, round(max(sh_samples) + 0.1, 1))
ax.set_xlim([l_xlim, u_xlim])
ax.set_ylim([0, X.shape[0] + (model.n_clusters + 1) * 10])
ax.set_xlabel("silhouette coefficient values")
ax.set_ylabel("cluster label")
ax.set_yticks(y_tick_pos_)
ax.set_yticklabels(str(idx) for idx in range(model.n_clusters))
ax.xaxis.set_major_locator(ticker.MultipleLocator(0.1))
ax.legend(loc="best")
return ax
k_max = 7
ncols = 3
nrows = k_max // ncols + (k_max % ncols > 0)
fig = plt.figure(figsize=(15,15), dpi=200)
for k in range(2,k_max+1):
km = KMeans(n_clusters=k,
max_iter=300,
tol=1e-04,
init='k-means++',
n_init=10,
random_state=42,
algorithm='auto')
km_fit = km.fit(data_train)
ax = plt.subplot(nrows, ncols, k-1)
silhouette_plot(data_train, km_fit,ax, cluster_colors)
fig.suptitle("Silhouette plots", fontsize=18, y=1)
plt.tight_layout()
plt.show()
Contrary to MCF7, here we do not have any big cluster for any choice of k.
Let's perform the clustering.
kmeans = KMeans(n_clusters=2, random_state=2352).fit(data_train)
kmeans.labels_
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1,
0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
0, 0, 0, 0, 0, 0], dtype=int32)
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=kmeans.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title("2-means clustering")
plt.show()
KM_plot(20, 100, kmeans)
Elevation: 20 Azimut: 100
kmeans2 = KMeans(n_clusters=3, random_state=2352).fit(data_train)
kmeans2.labels_
array([0, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1,
2, 1, 1, 0, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 0, 1, 1, 1, 1, 2, 2,
2, 1, 1, 1, 1, 2, 2, 2, 2, 0, 1, 0, 1, 1, 2, 2, 2, 1, 1, 0, 2, 2,
2, 2, 0, 1, 1, 0, 2, 0, 2, 1, 1, 0, 1, 2, 2, 0, 0, 0, 1, 1, 2, 0,
2, 2, 0, 2, 0, 1, 0, 1, 2, 2, 1, 1, 2, 0, 2, 1, 1, 1, 1, 1, 2, 2,
0, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1,
0, 1, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 1, 0, 1, 0, 0, 0, 2, 0, 1, 1,
1, 2, 0, 2, 2, 1, 1, 1, 0, 2, 2, 1, 0, 1, 1, 2, 2, 2, 0, 1, 1, 1,
2, 2, 2, 1, 2, 2], dtype=int32)
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=kmeans2.labels_, cmap=ListedColormap(cluster_colors[:3]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title("3-means clustering")
plt.show()
KM_plot(20, 100, kmeans2)
Elevation: 20 Azimut: 100
The clustering does not gives us a great result: we can clearly see how many cells are misclassified. Since the centroids are randomly inizialized, we could try out different seeds: the result doesn't change anyway, it remains similar. Comparing the 3-means clustering with the classification of the cells, we see that it seems to define two clusters (out of three) that are more accurate. Now let's compute the accuracy of the clustering:
og_labels = data_pr2_lab["Condition"].values
clust_predict = kmeans.labels_
clustering_accuracy(og_labels, clust_predict)
Clustering accuracy: 51.64835164835165 %
Hence, K-means clustering is able to distinguish cells with a 51.65% precision score, measured as the number of correct classifications divided by the total number of samples.
agglomerative = AgglomerativeClustering().fit(data_train)
agglomerative.labels_
array([1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1,
0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0])
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=agglomerative.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
AG_plot_k(20,100,2)
agg_predict = agglomerative.labels_
clustering_accuracy(og_labels, agg_predict)
Clustering accuracy: 61.53846153846154 %
The accuracy of agglomerative clustering is thus higher: agglomerative clustering gives better results.
We perform only agglomerative clustering, as it seems more promising than k-means.
agg_PC2 = AgglomerativeClustering().fit(data_pr2)
agg_PC2.labels_
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1,
1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1,
0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,
0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1,
1, 1, 1, 0, 1, 1])
aggPC2_predict = agg_PC2.labels_
clustering_accuracy(og_labels, aggPC2_predict)
Clustering accuracy: 51.0989010989011 %
x = np.array(data_pr3['PC1'])
y = np.array(data_pr3['PC2'])
plt.scatter(x, y, c=agg_PC2.labels_, cmap=ListedColormap(cluster_colors[:2]))
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
agg_PC3 = AgglomerativeClustering().fit(data_pr3)
agg_PC3.labels_
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1,
0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0,
0, 0, 0, 0, 0, 0])
aggPC3_predict = agg_PC3.labels_
clustering_accuracy(og_labels, aggPC3_predict)
Clustering accuracy: 57.142857142857146 %
AGPC_plot_int(20,100)
In the space of the principal components, agglomerative clustering performs worse.
Overall, clustering on HCC1806 does not give good results in trying to dividing cells into Normoxia and Hypoxia clusters. We should try different clustering methods to get a better division. We try out clustering using UMAP dimensionality reduction.
UMAP is a nonlinear dimensionality reduction technique that aims to preserve the local and global structure of the data. It constructs a high-dimensional graph representation of the data, where each data point is connected to its nearest neighbors and then optimizes the embedding of the data points in a lower-dimensional space in a way that the distances between connected points in the high-dimensional graph are preserved as closely as possible in the lower-dimensional embedding.
After dimensionality reduction with UMAP, we perform k-means clustering on the space of the reduced components.
UMA = KMeans(n_clusters=2)
labels_UM = UMA.fit_predict(embedding)
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels_UM, cmap=my_cmap)
plt.show()
clustering_accuracy(og_labels, labels_UM)
Clustering accuracy: 86.81318681318682 %
The accuracy of this technique is significantly higher.
We perform clustering on genes, to get more insights on the matter. We do it both in full dimension, projecting the results with PCA, and in the space of the principal components. We use the same methods and techniques as before:
We start by the determining the right number of clusters (with the same methods used before).
fig, ax = plt.subplots()
visualizer = KElbowVisualizer(KMeans(), k=(2,7),ax=ax)
visualizer.fit(data_genes)
ax.set_xticks(range(2,7))
visualizer.show()
plt.show()
silhouette_scores = []
for k in range(2, 7):
km = KMeans(n_clusters=k,
max_iter=300,
tol=1e-04,
init='k-means++',
n_init=10,
random_state=42,
algorithm='auto')
km.fit(data_genes)
silhouette_scores.append(silhouette_score(data_genes, km.labels_))
fig, ax = plt.subplots()
ax.plot(range(2, 7), silhouette_scores, 'bx-', color="black")
ax.set_title('Silhouette Score Method')
ax.set_xlabel('Number of clusters')
ax.set_ylabel('Silhouette Scores')
plt.xticks(range(2, 7))
plt.tight_layout()
plt.show()
/var/folders/r2/bld_lf155710hgxgss_mfsym0000gn/T/ipykernel_35472/1442953304.py:14: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "bx-" (-> color='b'). The keyword argument will take precedence.
def silhouette_plot(X, model, ax, colors):
y_lower = 10
y_tick_pos_ = []
sh_samples = silhouette_samples(X, model.labels_)
sh_score = silhouette_score(X, model.labels_)
for idx in range(model.n_clusters):
values = sh_samples[model.labels_ == idx]
values.sort()
size = values.shape[0]
y_upper = y_lower + size
ax.fill_betweenx(np.arange(y_lower, y_upper),0,values,
facecolor=colors[idx],edgecolor=colors[idx]
)
y_tick_pos_.append(y_lower + 0.5 * size)
y_lower = y_upper + 10
ax.axvline(x=sh_score, color="red", linestyle="--", label="Avg Silhouette Score")
ax.set_title("Silhouette Plot for {} clusters".format(model.n_clusters))
l_xlim = max(-1, min(-0.1, round(min(sh_samples) - 0.1, 1)))
u_xlim = min(1, round(max(sh_samples) + 0.1, 1))
ax.set_xlim([l_xlim, u_xlim])
ax.set_ylim([0, X.shape[0] + (model.n_clusters + 1) * 10])
ax.set_xlabel("silhouette coefficient values")
ax.set_ylabel("cluster label")
ax.set_yticks(y_tick_pos_)
ax.set_yticklabels(str(idx) for idx in range(model.n_clusters))
ax.xaxis.set_major_locator(ticker.MultipleLocator(0.1))
ax.legend(loc="best")
return ax
k_max = 7
ncols = 3
nrows = k_max // ncols + (k_max % ncols > 0)
fig = plt.figure(figsize=(15,15), dpi=200)
for k in range(2,k_max+1):
km = KMeans(n_clusters=k,
max_iter=300,
tol=1e-04,
init='k-means++',
n_init=10,
random_state=42,
algorithm='auto')
km_fit = km.fit(data_genes)
ax = plt.subplot(nrows, ncols, k-1)
silhouette_plot(data_genes, km_fit,ax, genes_colors)
fig.suptitle("Silhouette plots", fontsize=18, y=1)
plt.tight_layout()
plt.show()
These analysis suggest us that the best number of clusters should be either two (silhouette) or three (elbow). Moreover, from the silhouette plot we clearly see that for any choice of k there is a big, main cluster.
kmeans_g2 = KMeans(n_clusters=2, random_state=1324).fit(data_genes)
kmeans_g2.labels_
array([1, 1, 1, ..., 1, 0, 0], dtype=int32)
KM_plot_k_int(30, 120, 2)
diagnoses(kmeans_g2, data_genes, genes_colors)
kmeans_g3 = KMeans(n_clusters=3, random_state=1324).fit(data_genes)
kmeans_g3.labels_
array([2, 2, 1, ..., 1, 0, 0], dtype=int32)
KM_plot_k_int(30, 120, 3)
diagnoses(kmeans_g3, data_genes, genes_colors)
We also try with k=4, to see if we can spot other "classes" of genes.
kmeans_g4 = KMeans(n_clusters=4, random_state=1324).fit(data_genes)
kmeans_g4.labels_
array([3, 3, 1, ..., 1, 0, 0], dtype=int32)
KM_plot_k_int(30, 120, 4)
diagnoses(kmeans_g4, data_genes, genes_colors)
agglomerative_g2 = AgglomerativeClustering().fit(data_pr3_g)
agglomerative_g2.labels_
array([0, 0, 0, ..., 0, 1, 1])
We also perform an agglomerative clustering, again using the euclidean distance and the ward linkage, visualizing the result with a plot.
AG_g_plot(30,120)
The execution of all these tasks on HCC1806 gave us very similar results. Indeed, as noticed before, the "distribution" of genes seems to be very similar between the two cell lines. The only thing we want to point out regards the 4-means clustering, in which we don't have a slightly different separation, that maybe could mean something from a biological point of view.
KM_plot_k_int(30,120,4)
We now delve into the heart of supervised machine learning methods to understand the dynamics of our gene expression data across the different sequencing techniques and cell types. The goal is to create models that are not only accurate but also offer insights into the nature of the data and the underlying biological processes.
We have four datasets at our disposal - MCF7 and HCC1806, each sequenced with both SmartSeq and DropSeq techniques. For each of these datasets, we've employed a range of supervised learning algorithms: Support Vector Machines (SVM), Random Forests, and Logistic Regression. HCC1806 - DropSeq makes an exception for this, when we have also attempted a MLP Classifier.
The choice of these algorithms was influenced by their diverse strengths. SVMs are particularly adept at handling high-dimensional data, a common characteristic of gene expression datasets. Random Forests, on the other hand, are known for their robustness to overfitting and their ability to handle nonlinear relationships. Logistic Regression, while seemingly simpler, is a highly interpretable model that can provide insights into which genes are most informative in distinguishing between cell types.
To optimize the performance of each of these models, we undertook hyperparameter tuning. This process was carried out with a focus on achieving a fine balance between computational complexity and model performance. The underlying premise was to ensure that our models are not only accurate but also efficient - a crucial aspect when dealing with large-scale gene expression data.
In order to build a more powerful classifier, we exploited the power of ensemble learning: by leveraging the strengths of multiple learning algorithms, we aimed to construct an ensemble model that offers improved predictive performance and robustness. This integrative approach often helps to achieve better performance by capturing more complex underlying structures in the data.
Given the extent of our analysis, we have decided to maintain here a focus on a single dataset, i.e. MCF7 SmartSeq - this will allow us to delve deeper into the analytical process without compromising readability. However, please note that all analyses were conducted similarly across all datasets, and we will bring in results from other datasets where they offer interesting contrasts or confirmations.
In all classifiers we follow the same steps:
Tune the hyperparameters and select the best model;
Make some plots: decision boundary and accuracy vs number of features;
Performance on test set.
Here, we list the libraries and methods we have used from Scikit's library. Then, we import dataset and add the target labels and we split into training and test sets. Test sets will be used in the evaluation section to assess the performance of the model on unseen data. In regards to the MCF7 dataset, we observe a relatively balanced distribution across the 2 labels. Therefore we believed accuracy was an appropriate evaluation metric for our case. Accuracy is not only straightforward and easy to interpret, but it's also widely recognized and used in the field. This will allow us to maintain clarity in our performance assessment while ensuring the results are still meaningful.
#Importing the dataset and adding label
df = pd.read_csv("drive/MyDrive/Datasets/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt", sep="\ ")
df = df.T
df['label'] = df.index.to_series().apply(lambda x: 'Normoxia' if 'Norm' in x else 'Hypoxia')
<ipython-input-6-c05a2eaee16d>:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
df = pd.read_csv("drive/MyDrive/Datasets/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt", sep="\ ")
df["label"].value_counts() #pretty balanced! accuracy is fine
Normoxia 126 Hypoxia 124 Name: label, dtype: int64
#Creating X and y
X = df.drop("label", axis = 1)
y = df["label"]
#Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(187, 3000) (187,) (63, 3000) (63,)
We use the standard Logistic Regression object from Scikit and we tune the coefficient C (inverse of regularization strength) within a set of values. Note that here we employed for cross validation the `neg_log_loss scoring, as it is the model that deals with probabilistic outputs.
log = LogisticRegression(solver='liblinear')
params_log = {'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
log_gs = GridSearchCV(log, params_log, cv=5, scoring=['neg_log_loss'], refit='neg_log_loss')
log_gs.fit(X_train, y_train)
log_gs.best_estimator_
LogisticRegression(C=1, penalty='l1', solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=1, penalty='l1', solver='liblinear')
best_log = log_gs.best_estimator_
log_gs.best_params_
{'C': 10, 'penalty': 'l1'}
This will be our chosen model for Logistic Regression: it will use l1 penalty and C equal to 10. As simple as it can seem, its performance is outstanding:
acc_log = cross_val_score(best_log, X_train, y_train).mean() #will be used as weight in Ensemble Classifier
ypredlog = best_log.predict(X_test)
accuracy_logistic = accuracy_score(y_test, ypredlog)
accuracy_logistic
1.0
We can also explore how the accuracy of the model behaves as we change the number of features we train it with. As a reference, look at this graph:
features_range = range(1, 101, 5)
scores = []
for n in features_range:
# Select top n features
selector = SelectKBest(mutual_info_classif, k=n)
X_new = selector.fit_transform(X_train, y_train)
# Train the model
model = LogisticRegression(C=10, penalty='l1', solver='liblinear')
score = cross_val_score(model, X_new, y_train, cv=5, scoring='accuracy').mean()
scores.append(score)
plt.figure(figsize=(10, 6))
plt.plot(features_range, scores, marker='o')
plt.xlabel('Number of features')
plt.ylabel('Accuracy')
plt.title('Number of features vs Accuracy')
plt.grid(True)
plt.show()
In this analysis, we incrementally increased the number of features, or genes, utilized by the model in increments of five. These "best genes" were carefully selected based on their calculated mutual information, following the approach we employed in earlier sections of this project.
The model's performance accelerates rapidly, achieving perfect accuracy when about 70 genes are incorporated. This exceptional performance may be due to the excellent quality of our dataset. The genes included have been largely curated, specifically chosen for their explanatory power in distinguishing between hypoxia and normoxia conditions.
Our next model is SVM, and to implement it we are going to use the SVC() class from Scikit.
The procedure for this section follows essentially the other ones, with a noteworthy addition of a discussion on precision-recall tradeoff. We believe this quick remark on error analysis enables a more comprehensive understanding of our model performance.
As SVM's complexity increases dramatically with dimensionality, we employed different strategies based on dataset size. For larger DropSeq datasets, we opted for Randomized Search over Grid Search for efficiency. Due to significant running times, we also had to simplify the process by reducing the number of hyperparameters to be tuned.
# Define the parameter grid
param_grid = {'kernel': ['rbf', 'sigmoid', 'poly', 'linear'], 'C': [0.1, 1, 10, 100], 'gamma': [1, 10, 100], 'degree': [2, 3, 4, 5]}
# Create the SVM model
svm_model = SVC()
# Perform grid search with cross-validation
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
grid_search.fit(X_train, y_train)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py:378: FitFailedWarning:
100 fits failed out of a total of 960.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
100 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.10/dist-packages/sklearn/svm/_base.py", line 270, in fit
raise ValueError(
ValueError: The dual coefficients or intercepts are not finite. The input data may contain large values and need to bepreprocessed.
warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:952: UserWarning: One or more of the test scores are non-finite: [0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
nan 0.99473684 0.50810811 0.50810811 nan 0.99473684
0.50810811 0.50810811 nan 0.99473684 0.50810811 0.50810811
nan 0.99473684 0.50810811 0.50810811 nan 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
nan 0.99473684 0.50810811 0.50810811 nan 0.99473684
0.50810811 0.50810811 nan 0.99473684 0.50810811 0.50810811
nan 0.99473684 0.50810811 0.50810811 nan 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
nan 0.99473684 0.50810811 0.50810811 nan 0.99473684
0.50810811 0.50810811 nan 0.99473684 0.50810811 0.50810811
nan 0.99473684 0.50810811 0.50810811 nan 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
0.99473684 0.99473684 0.50810811 0.50810811 0.99473684 0.99473684
0.50810811 0.50810811 0.99473684 0.99473684 0.50810811 0.50810811
nan 0.99473684 0.50810811 0.50810811 nan 0.99473684
0.50810811 0.50810811 nan 0.99473684 0.50810811 0.50810811
nan 0.99473684 0.50810811 0.50810811 nan 0.99473684]
warnings.warn(
GridSearchCV(cv=5, estimator=SVC(),
param_grid={'C': [0.1, 1, 10, 100], 'degree': [2, 3, 4, 5],
'gamma': [1, 10, 100],
'kernel': ['rbf', 'sigmoid', 'poly', 'linear']},
scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5, estimator=SVC(),
param_grid={'C': [0.1, 1, 10, 100], 'degree': [2, 3, 4, 5],
'gamma': [1, 10, 100],
'kernel': ['rbf', 'sigmoid', 'poly', 'linear']},
scoring='accuracy')SVC()
SVC()
# Best model
best_svm = grid_search.best_estimator_
# Get the best parameter values
best_parameters = grid_search.best_params_
best_parameters
{'C': 0.1, 'degree': 2, 'gamma': 1, 'kernel': 'poly'}
# Accuracy on training data
cross_val_score(best_svm, X_train, y_train, cv=5, scoring="accuracy")
array([0.97368421, 1. , 1. , 1. , 1. ])
acc_svm = cross_val_score(best_svm, X_train, y_train).mean() #will be used later
# Confusion matrix
predictions = cross_val_predict(best_svm, X_train, y_train, cv=3)
conf_matrix = confusion_matrix(y_train, predictions)
conf_matrix
array([[92, 0],
[ 1, 94]])
# With percentages
row_sums = conf_matrix.sum(axis=1, keepdims=True)
norm_conf_matrix = np.round(conf_matrix / row_sums, 2)
norm_conf_matrix
array([[1. , 0. ],
[0.01, 0.99]])
#Prediction and recall
print("Precision score =",conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[0, 1]))
print("Recall score =",conf_matrix[1, 1] / (conf_matrix[1, 1] + conf_matrix[1, 0]))
Precision score = 1.0 Recall score = 0.9894736842105263
Comment on precision and recall
In the context of our analysis, cells subjected to Hypoxia emerge as potential indicators of malign tumour. The paramount objective, therefore, would be to accurately flag these cells, considering their vital role in the onset of cancer.
Consequently, one strategy could involve orienting our classifier to prioritize the accurate identification of Hypoxia cells, even at the risk of occasional misclassifications (such as falsely labelling Normoxia cells as Hypoxia, the so-called false positives). This necessitates a classifier with modest recall, yet high precision.
However, this approach deviates from our initial assignment. Our principal task is to adeptly distinguish between the two cellular conditions: Hypoxia and Normoxia, without an overemphasis on either. Our mission remains unbiased discernment rather than prioritized detection, but this is something a doctor may want to consider.
Decision Boundary
#Splitting dataset
df_train, df_test = train_test_split(df, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
df_train = df_train.transpose()
(187, 3000) (187,) (63, 3000) (63,)
def find_word(string, word1, word2):
string = ''.join(filter(str.isalpha, string))
word1 = ''.join(filter(str.isalpha, word1))
for i in range(len(string) - len(word1) + 1):
if string[i:i+len(word1)] == word1:
return word1
return word2
def remove_double_quotes(word):
return word.replace('"', '')
df_train = df_train.rename(columns={"{}".format(i):"{}".format(remove_double_quotes(i)) for i in df_train.columns})
df_train = df_train.rename(columns={"{}".format(i):"{}".format(find_word(i, "Norm", "Hypo")) for i in df_train.columns})
df_train = df_train.transpose()
df_train = df_train.drop(columns=['label'])
df_train
| "CYP1B1" | "CYP1B1-AS1" | "CYP1A1" | "NDRG1" | "DDIT4" | "PFKFB3" | "HK2" | "AREG" | "MYBL2" | "ADM" | ... | "CD27-AS1" | "DNAI7" | "MAFG" | "LZTR1" | "BCO2" | "GRIK5" | "SLC25A27" | "DENND5A" | "CDK5R1" | "FAM13A-AS1" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Hypo | 14546 | 5799 | 6817 | 338 | 3631 | 460 | 1259 | 0 | 76 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Hypo | 6734 | 2631 | 226 | 1203 | 6612 | 3025 | 961 | 142 | 32 | 838 | ... | 20 | 0 | 54 | 33 | 0 | 0 | 0 | 109 | 0 | 0 |
| Hypo | 4099 | 1583 | 0 | 401 | 1877 | 1691 | 274 | 1220 | 300 | 234 | ... | 0 | 0 | 26 | 151 | 0 | 0 | 0 | 58 | 0 | 0 |
| Norm | 196 | 102 | 1 | 243 | 266 | 278 | 78 | 1 | 199 | 0 | ... | 79 | 0 | 1 | 0 | 0 | 0 | 0 | 45 | 19 | 0 |
| Hypo | 4596 | 1689 | 5136 | 1496 | 4329 | 3666 | 3566 | 77 | 173 | 124 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 39 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Hypo | 29803 | 12073 | 8024 | 1414 | 7148 | 4941 | 2937 | 468 | 293 | 486 | ... | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 129 | 0 | 0 |
| Hypo | 1338 | 554 | 14 | 634 | 3513 | 1360 | 303 | 558 | 178 | 994 | ... | 0 | 0 | 46 | 5 | 0 | 0 | 0 | 14 | 0 | 0 |
| Hypo | 12647 | 5175 | 61 | 608 | 4343 | 1175 | 1410 | 39 | 1 | 1946 | ... | 24 | 0 | 17 | 0 | 0 | 0 | 0 | 101 | 0 | 22 |
| Hypo | 5954 | 2311 | 0 | 3884 | 12034 | 5986 | 5103 | 0 | 0 | 1242 | ... | 0 | 0 | 235 | 0 | 0 | 0 | 0 | 10 | 0 | 21 |
| Norm | 0 | 0 | 0 | 0 | 196 | 3 | 0 | 1 | 461 | 0 | ... | 0 | 0 | 62 | 0 | 0 | 0 | 0 | 21 | 0 | 0 |
187 rows × 3000 columns
data_train_transpose = pd.DataFrame.transpose(df, copy=True)
data_train_transpose_lab = data_train_transpose.copy() #with labels
data_train_transpose_lab['label'] = data_train_transpose.index.to_series().apply(lambda x: 'Norm' if 'Norm' in x else 'Hypo')
# features = data_train_transpose.columns
PCA2_data = PCA(n_components=2)
principalComponents_hcc2 = PCA2_data.fit_transform(df_train)
data_pr2 = pd.DataFrame(data = principalComponents_hcc2
, columns = ['PC1', 'PC2'])
print('Explained variation per principal component: {}'.format(PCA2_data.explained_variance_ratio_))
Explained variation per principal component: [0.6446813 0.08999785]
data_pr2_lab = data_pr2.copy()
data_pr2_lab["Condition"] = [i for i in range(len(df_train))]
for i in range(len(df_train)):
if (df_train.index[i] == "Norm"):
data_pr2_lab["Condition"][i] = 0
elif (df_train.index[i] == "Hypo"):
data_pr2_lab["Condition"][i] = 1
data_pr2_lab_copy = data_pr2_lab.copy()
data_pr2_lab_copy.drop('Condition', axis=1, inplace=True)
<ipython-input-15-4b9e55b262c8>:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data_pr2_lab["Condition"][i] = 1 <ipython-input-15-4b9e55b262c8>:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data_pr2_lab["Condition"][i] = 0
kernels = ['linear', 'rbf', 'sigmoid', 'poly']
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
for idx, kernel in enumerate(kernels):
svm_model = SVC(kernel=kernel, C=0.1)
svm_model.fit(data_pr2_lab_copy, data_pr2_lab["Condition"])
ax = axes[idx // 2][idx % 2]
# Scatter plot of the data points (red points are cells in Normoxia, green ones are in Hypoxia)
ax.scatter(x, y, c=data_pr2_lab["Condition"], cmap="prism")
ax.set_title(kernel)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
# Create a mesh grid of points
x_min, x_max = data_pr2_lab.iloc[:, 0].min() - 1, data_pr2_lab.iloc[:, 0].max() + 1
y_min, y_max = data_pr2_lab.iloc[:, 1].min() - 1, data_pr2_lab.iloc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 100), np.arange(y_min, y_max, 100))
# Obtain predicted class labels for each point in the mesh grid
Z = svm_model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the decision boundary and the margin
ax.contour(xx, yy, Z, colors='b', linewidths=0.5)
plt.suptitle('Decision Boundaries for different Kernels', fontsize=16)
plt.tight_layout()
plt.show()
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but SVC was fitted with feature names warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but SVC was fitted with feature names warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but SVC was fitted with feature names warnings.warn( /usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but SVC was fitted with feature names warnings.warn(
To analyze on the distinct properties of various kernels, we wanted to illustrate the decision boundaries rendered by each model. Given the necessity for visual representation in two dimensions, it was imperative to reduce our 3000-dimensional dataset to its two most informative axes. We accomplished this through Principal Component Analysis (PCA), selecting the two principal components that accounted for the maximum proportion of the dataset's variance. Thus, from the plots we can visualize the decision boundaries in a manner that encapsulates the critical features of our high-dimensional dataset.
Testing the number of features
features_range = range(1, 101, 5)
scores = []
for n in features_range:
# Select top n features
selector = SelectKBest(mutual_info_classif, k=n)
X_new = selector.fit_transform(X_train, y_train)
# Train the model
model = best_svm
score = cross_val_score(model, X_new, y_train, cv=5, scoring='accuracy').mean()
scores.append(score)
plt.figure(figsize=(10, 6))
plt.plot(features_range, scores, marker='o')
plt.xlabel('Number of features')
plt.ylabel('Accuracy')
plt.title('Number of features vs Accuracy')
plt.grid(True)
plt.show()
Accuracy on test set
# Test on test
best_svm.fit(X_train, y_train)
test_accuracy = best_svm.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)
We now move to Random Forest, which is itself a model of ensemble learning, as it bases its predictions on the set of Decision Trees it creates.
rf = RandomForestClassifier(random_state=42)
params_rf = {"n_estimators": [25, 50, 100, 200, 300], "max_leaf_nodes" : np.arange(20, 100, 10)}
rf_gs = GridSearchCV(rf, params_rf, cv=5)
rf_gs.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
param_grid={'max_leaf_nodes': array([20, 30, 40, 50, 60, 70, 80, 90]),
'n_estimators': [25, 50, 100, 200, 300]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
param_grid={'max_leaf_nodes': array([20, 30, 40, 50, 60, 70, 80, 90]),
'n_estimators': [25, 50, 100, 200, 300]})RandomForestClassifier(random_state=42)
RandomForestClassifier(random_state=42)
rf_gs.best_estimator_
RandomForestClassifier(max_leaf_nodes=20, n_estimators=25, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_leaf_nodes=20, n_estimators=25, random_state=42)
rf_gs.best_params_
{'max_leaf_nodes': 20, 'n_estimators': 25}
best_rf = rf_gs.best_estimator_
acc_rf = cross_val_score(best_rf, X_train, y_train).mean() #will be used later
1.0
Again, analyzing the number of features against the performance of the model, and the scores are quite impressive:
scores = []
for n in features_range:
# Select top n features
selector = SelectKBest(mutual_info_classif, k=n)
X_new = selector.fit_transform(X_train, y_train)
# Train the model
model = RandomForestClassifier(max_leaf_nodes = 20, n_estimators = 25, random_state = 42)
score = cross_val_score(model, X_new, y_train, cv=5, scoring='accuracy').mean()
scores.append(score)
plt.figure(figsize=(10, 6))
plt.plot(features_range, scores, marker='o')
plt.xlabel('Number of features')
plt.ylabel('Accuracy')
plt.title('Number of features vs Accuracy')
plt.grid(True)
plt.show()
Trying to investigate which features (genes in our case) are of most importance, we employ the feature_importances_ attribute of the trained RandomForest model. This attribute computes the mean decrease in impurity, which is observed when splitting the data based on a particular feature, averaged over all trees in the forest.
These genes stand out because their expression levels (either higher or lower than certain thresholds) provide pivotal information for the model to distinguish between cells that have been exposed to hypoxic versus normoxic conditions. Their higher importance scores indicate that changes in these genes' expression levels have a profound effect on the cell's response to oxygen levels, making them key players in our classification task.
feature_importances = rf_gs.best_estimator_.feature_importances_
features = X.columns
# create DataFrame to hold the feature names and their corresponding importance scores
feature_importance_df = pd.DataFrame({
'Feature': features,
'Importance': feature_importances
})
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
print(feature_importance_df)
Feature Importance 103 "MT-CYB" 0.071049 477 "FAM162A" 0.068452 22 "BNIP3" 0.061850 869 "ARPC1B" 0.059913 1589 "DOLK" 0.052570 ... ... ... 1023 "PYCR3" 0.000000 1024 "KANK3" 0.000000 1025 "KRT83" 0.000000 1026 "ZNF592" 0.000000 2999 "FAM13A-AS1" 0.000000 [3000 rows x 2 columns]
A few comments are noteworhy at this point:
When applying the same methodology to the other datasets, such as that of HCC1806, we identify another set of significant genes: NDRG1, well-known to be involved in stress responses, cell growth, and differentiation; it has been identified as a potential tumor suppressor gene and is often downregulated in several types of cancer[4]. DDIT4, familiar to regulate cell response to stress and is often upregulated in response to hypoxia[5].
We now move to develop an ensemble learning framework. In order to prioritize those models that performed better we stratify the voting procedure in accordance with the mean accuracy exhibited by each model on the validation sets. This approach ensures a higher influence for the more accurate models within the ensemble structure.
best_models = [('log', best_log), ('svm', best_svm), ('rf', best_rf)]
accuracies = [acc_log, acc_svm, acc_rf]
ensemble = VotingClassifier(best_models, weights=accuracies)
ensemble.fit(X_train, y_train)
VotingClassifier(estimators=[('log',
LogisticRegression(C=1, penalty='l1',
solver='liblinear')),
('svm',
SVC(C=0.1, degree=2, gamma=1, kernel='poly')),
('rf',
RandomForestClassifier(max_leaf_nodes=20,
n_estimators=25,
random_state=42))],
weights=[1.0, 0.9682539682539683, 1.0])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. VotingClassifier(estimators=[('log',
LogisticRegression(C=1, penalty='l1',
solver='liblinear')),
('svm',
SVC(C=0.1, degree=2, gamma=1, kernel='poly')),
('rf',
RandomForestClassifier(max_leaf_nodes=20,
n_estimators=25,
random_state=42))],
weights=[1.0, 0.9682539682539683, 1.0])LogisticRegression(C=1, penalty='l1', solver='liblinear')
SVC(C=0.1, degree=2, gamma=1, kernel='poly')
RandomForestClassifier(max_leaf_nodes=20, n_estimators=25, random_state=42)
predictions = ensemble.predict(X_test)
accuracy_score(y_test, predictions)
1.0
We decided to devote a unique section of our project to the investigation of HCC1806 - DropSeq dataset. Indeed, this was the most challenging dataset we had to deal with. Its dimensionality (14682x3000) emerged as a substantial obstacle during model training. For the first time, we confronted with the difficult decision of trading off accuracy for computational efficiency. Despite enduring lengthy waits for the optimization of hyperparameters—sometimes stretching into hours—we strived to seek even slight improvements in accuracy: starting from 90% of accuracy with Random Forest, we managed to achieve 94% by finding the optimal combination of hyperparameters, although it took several hours (and cool temperature in the room!). Ultimately, our search led us to a more intricate model capable of capturing relationships that were elusive to our previous models. This was the motivation behind the implementation of a small Neural Network for this particular dataset.
hcc = pd.read_csv("drive/MyDrive/Datasets/HCC1806_Filtered_Normalised_3000_Data_train.txt", sep="\ ")
hcc = hcc.T
hcc['label'] = hcc.index.to_series().apply(lambda x: 'Normoxia' if 'Norm' in x else 'Hypoxia')
<ipython-input-31-471380791b73>:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
hcc = pd.read_csv("drive/MyDrive/Datasets/HCC1806_Filtered_Normalised_3000_Data_train.txt", sep="\ ")
#Creating X and y
X = hcc.drop("label", axis = 1)
y = hcc["label"]
#Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(11011, 3000) (11011,) (3671, 3000) (3671,)
nn = MLPClassifier(random_state=42, batch_size='auto', max_iter=1000000, solver='sgd')
# alert: this cell will run for approximately 25 minutes
params_grid = {'hidden_layer_sizes': [(50,), (100,), (50,50), (100,100)], 'learning_rate_init': [0.1, 0.01, 0.001]}
nn_gs = GridSearchCV(nn, params_grid, cv=3, verbose=2)
nn_gs.fit(X_train, y_train)
# chosen MLP:
nn_best = MLPClassifier(random_state=42, batch_size='auto', max_iter=1000000, solver='sgd', hidden_layer_sizes=(100,), learning_rate_init=0.1)
nn_best.fit(X_train, y_train)
MLPClassifier(learning_rate_init=0.1, max_iter=1000000, random_state=42,
solver='sgd')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. MLPClassifier(learning_rate_init=0.1, max_iter=1000000, random_state=42,
solver='sgd')ypred = nn_best.predict(X_test)
accuracy_score(y_test, ypred)
0.9591391991283029
We are quite content with the outcome as we've achieved nearly 96% accuracy by merely adjusting a few hyperparameters of the Neural Network, a process that took about 20 minutes. In contrast, a similar 94% accuracy level was achieved previously, but it required several hours of waiting. This performance not only highlights the efficiency of the Neural Network model but also its efficacy in this specific application.
This model has been used, together with the usual others we have trained on HCC, to strive to correctly predict the anonymous dataset.
Our group project has led us to a rigorous exploration of gene expression data, with the ultimate objective of distinguishing between hypoxic and normoxic conditions within single cells. This attempt encompassed a wide range of techniques and methodologies, from general EDA to principal component analysis, from clustering to predictive models such as logistic regression, support vector machines, random forests, neural networks, and finally ensemble learning. We tried to make judicious decisions along the way, such as adopting randomized search over grid search for large datasets to optimize computational efficiency and time. We also highlighted the trade-offs between precision and recall. An intriguing facet of our project was the extraction of feature importance, enabling us to identify genes that play a pivotal role in hypoxic conditions. This not only offers intriguing insights into the biological processes but also holds potential for further research. The ensemble learning approach integrated the strengths of various classifiers and reinforced prediction accuracy, with each model's vote being weighted according to its performance. This strategy lent our model robustness, enhancing our confidence in its predictive power. Through this project, we have not only tested our data analysis and machine learning skills but also gained insights into the intricate world of genetics and cancer biology. We hope our findings contribute to the larger conversation on cell conditions and cancer research.